December 2017 saw lots of renewed focus on IBM Unified Governance and Integration ("UG&I") with focus on modernization especially DataStage. IBM showcased and later released Data Flow Designer a Web-based Client for DataStage. It addressed major requirement from its clients, large user community of Data Engineers over years.
December 2018 was no different with IBM bringing another addition to its "UG&I" portfolio with a focus on Spark. Recently IBM released support for Spark as an alternate Runtime for Data Flow Designer (DataStage jobs) or we can say DataStage with Spark. It means now the clients have the choice to use IBM proprietary PX Engine or Spark as a runtime (Engine) to process their data. As PX discussed at length by multiple people over years including me focusing on what current release (Spark support) brings to the table for DataStage developers and users.
Data Flow Designer supports a new Job Type Spark (apart from other Job Types like Parallel, Sequencer etc. in the past). The user can use Data Flow Designer to create a DataStage Job meant for Spark runtime, compile it and even run it on Spark.
To use Spark user need to provide Spark Cluster details or Yarn details.
DataStage Jobs created for Spark also visible with new Job Type in the Jobs Dashboard.
With initial release, DataStage on Spark supports connectors like Db2, Greenplum, Hive, local file system, Netezza, Amazon S3, Oracle, SQL Server, and Teradata.
Apart from connectivity Data Stage Jobs on Spark can currently use transformation features like Filter, Funnel, Join, Merge, Remove Duplicates, Sort, and Transformer.
When you submit DataStage job to spark via Yarn it gets treated as any other application and can view its status and or logs using yarn UI, command line.
It seems Scala is the driver behind the scene if you look at the content executed, it is indeed great news. Can help integrate these flows with rest of the Data Scientist pipeline within Airflow or any other tool.
Currently, the user needs to decide runtime in advance and create DataStage jobs meant for Spark as their runtime.
To use DataStage with Spark Hadoop Deployment is not required. The user can leverage their existing deployment and provide Spark Cluster details. Now, this Spark Cluster can be anywhere, access to cluster required from within Services Tier along with Spark Client Libraries.
I'll share more details in my upcoming blog and maybe a Youtube Video post Think 2019 along with few relevant use cases. I am sure IBM will release additional features supported on DataStage for Spark Runtime soon :-)
-Ritesh
Disclaimer: “The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.”