Friday, April 6, 2018

ETL is dead : Are we crazy?

Last few quarters having discussion with different Industry leaders around Data Integration and its future. During discussion I realised few of them have  misconceptions around ETL. They query me about future of ETL and cross check is it really dead or outdated? 

I wasn't surprised. Their concerns are genuine as they relies on ETL processing for their business reporting. ETL dead or outdated topic keeps coming every few quarter with new product comes into market and need to find market for itself. This topic is not new as even in 2010-11 I had written multiple times on this topic. This time I decided to be put things into perspective and address some of the misconceptions. I am referring “Kafka” as an example where different field experts and vendors using it as reference and claim “ETL is dead” :-).
Let me provide some basic clarity on ETL. ETL is a concept and it was there for decades and will remain as long as data processing in any form remains.  ETL is nothing but Extract, Transform and Load. ETL is nothing but Data Integration and Governance and it gets serviced by multiple vendors with their tools. Historically ETL tools (not ETL) replaced SQL use inside the enterprises to streamline their process. “Reiterating” ETL tools haven’t replaced SQL but complimented its extensive use with automation and governance with additional features. ETL became ELT, TELT, TETL but end result was still same Data Integration or ETL. 

Let me focus on "T" of ETL before addressing E & L. This “T” contains everything about data enrichment. Simple, complex, continuous, dynamic, realtime all possible based on how vendor deliver their capability. Yes it is manual, error prone, trial and learn and need to evolve. Now “E” can be from anywhere. It depends on capability of tool and not of ETL concept on what sources Vendor think important for it to support and address its client’s requirement. Same is true for “L”. For these tools Kafka is also another source or target and it is what Kafka claims anyway. 

Last 5 years people were focusing on real time and “process immediate” requirements.  Real time processing doesn’t take ETL away. It still need to process incoming data by transforming it (enriching it) and finally loading data which makes sense as par policy requirements or future requirement. Now target can be anywhere. 

If enterprise has to calculate profit of last quarter for a region, you still need to pull specific data and process it. It is not real time data but result can be real time based on real-time processing. Processing of data and providing results “now” using in-memory processing still doesn't kill ETL as you still extracted it and transformed it by applying business logic of calculation. Use ETL tool, SQL or Kafka is the choice enterprise need to make. How much manual or automation and governance they want to have is their choice.  It doesn’t make ETL tool batch, realtime or out dated.  It depends on how its runtime works and process data internally. Can it work as a continuous pipe which can always be updated and scale more like DataStage PX Engine? Evolution of ETL tools is driven by Industry usage and it is happening. 

Recently saw multiple blogs around Kafka and how efficiently it is replacing ETL with real time processing. I am not even sure people commenting ETL dead ever looked at EDW or even Cassandra or Hive data stores or even understood why fact table is required and what problem is solve. I agree with multiple people unaware of real usage of ETL tools continues to use SQL behind the scene and maintain scripts even though these tools were meant to remove manual processes or scripts.

Bringing Kafka into play doesn’t make any processing realtime. Once you process the event from Kafka It is tool or code which process data for further consumption. Now if enterprise hasn’t moved its script based execution even within ETL tools, why they will do it for Kafka? And even if they are going to do it, it will again be ETL architecture but yes will look cleaner ETL architecture and hence ETL not dead. Yes you can say in current form existing ETL tools are outdated and they need to evolve. DataStage Datasets used to provide mechanism to store transformations for future use and load to any target with out any need to re-apply business logic for last twenty years. I am sure other tools have it as well. 

ETL tools process Terabytes of data daily basis. So Kafka processing billions of rows not a surprise. Is it really Kafka scaling and transforming these billions of records or engines processing these records and providing insight about those events and storing them are driving change? Kafka trying to become another ETL tool by providing extract and load connectors using connectAPI and transform features for enrichment. So yes , another ETL tool which few are referring to say “ETL is dead / outdated”

Even events coming to Kafka need to be loaded by some tool or connector to extract from Kafka. Now it can be in any format (e..g JSON), need some transformation layer to pick specific content for processing or join or lookup or even removing unwanted content (e.g. duplicate), update missing content intelligently and finally push it to Hadoop or Cloud and even to Kafka so other process can consume it. It is nothing but “ETL”.  Kafka is not a storage but a pipe with in and out or source and sink. Prior to Kafka even MQ, JMS and other tools in the past provided this kind of processing. 

Like in the past when users preferred to create SQL(s) and maintain their scripts instead of efficiently use ETL tools, engineers trying to learn new technologies. It is a welcome change. But arrival of new tool or language doesn’t replace existing one as is with Java and C (still going strong) or Perl or other scripts. But this doesn’t make ETL dead, old or out dated. Tools can be outdated or replaced with advanced ones provided they scale well and address the problem.  Even existing ETL tools drive better Kafka usage than Kafka along with its connect API by consuming and enriching data directly within its pipeline and push to Kafka. Other pipeline can take this message / event and perform required enrichment if required and then push it to target.

You can’t artificially create a problem to solve with new tool :-)

I agree vendors owning ETL tools need to drive architectural changes with more innovation to address new challenges  and adapt to new kind of processing requirement. But please stop using term “ETL is Dead” and instead can say “Long Live ETL” or Data integration :-) as it integrate data of all available sources will continue to do so as long as data processing required.

Next one on RDBMS, Hadoop and Cloud :-) and may be Spark / NiFi / SAM

-Ritesh
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Saturday, December 23, 2017

DataStage Flow Designer - Introduction

Here is new look of Information Server 11.7 Launchpad. A closer look at the launchpad, all tagged "New" are core new Features as part of this release. Focus of this blog is DataStage Flow Designer.


Click on DataStage Flow Designer Icon under Integration and it will open the DataFlow Designer window. You can also open it directly using URL https:///ibm/iis/dscdesigner/#/ .  Initial Screen will have Project Selection prompt more like existing DataStage Designer. In case user is not logged in, it will prompt for Information Server user. You can choose the project want to work with. And yes, you can switch across projects anytime in no time.

Once select appropriate project, can see videos on how to create first Job or overview about DataStage Flow Designer. 

Click on "Creating your First Job" to learn how to create DataStage job. DataStage flow Designer keep Design time completely separate from runtime. User not connected to engine before compile or execute the job.

Backward Compatible : User can open any existing Job within IBM DataStage Flow Designer. No migration is required. User can also open any new job created using new user interface inside existing Windows based Designer. You can see same Job opened in DataStage flow Designer and existing Designer. DataStage flow designer provides design time capabilities and keep flow independent of runtime. User do not specify runtime engine for designing a Job.


DataStage Flow Runtime  : User specify runtime during compile and run stage. Currently based on project selection, it detects the runtime and deploy the content for execution in secured manner. Communication between Design time and runtime is completely secure. View provides access to compiled OSH content and Job Logs. Detailed job logs available via Operational Console.  



Quick Navigation : DataStage Flow Designer enable user to mark specific job as favourite and access them easily (e.g. in welcome page) at later stage. User can even choose to show only bookmarked jobs. It also provides quick search capabilities (e.g. timestamp, description, name etc), saving time in looking for a job across repository or folder navigation. Virtual scrolling enable to list thousands of jobs within the Project.

Other Key Features : Use Palette to drag and drop available connectors and operators to canvas. Different stages can be linked to Nodes / Links as visible in the Job Design. Can review and edit stage and column properties. You can leverage mini-map on the lower-right to focus on certain area in large complex job.

I'll share details via video blogs for different features. You can click for DataStage Flow Designer Overview.

Disclaimer: “The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.”

DataStage Flow Designer - Creating First Job

Here is a short video on how to create your first Job using DataStage Flow Designer. I'll share another video on how to Compile and Execute the Job.



Disclaimer: “The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.”

Friday, December 22, 2017

DataStage Flow Designer - API driven Data Integration

In last couple of years, Hadoop and Cloud journey pushed Data Integration space to reinvent. Data Consumption models are changing. Process data requires these models gel together with cohesive strategy and not isolated Hadoop, Cloud and traditional data processing deployments.  
I spent sometime understanding this evolving space and now decided to take pen and share my perspective with series of blogs.  

Let me start with latest happening in this space, IBM recently released Information Server 11.7. Information Server consolidated integration and Governance space in 2007 and provided new features every year. 
After a decade, IBM decided to bring next wave of innovation as part of IBM Information Server 11.7.  Changes were visible with Information Server 11.5 release on Governance side (IA Thin Client), 11.7 introduce next level of changes. It brings API based Thin Client for DataStage.  "YES" you read it right. IBM introduce a DataStage flow Designer which can be used from any browser and allow you to create DataStage jobs.  Here are few interesting features :
  • New client is backward compatible allows user to create or edit jobs both in Thin Client and existing Thick Client DataStage Designer. 
  • User can compile the flow from within Thin Client (no need for Windows Client)
  • User can execute the Job from within Thin Client and view logs both in existing Director, Designer or Operational Console
  • User can create DataStage Job using REST API(s)
  • User can integrate these API(s) with their automated build pipeline (eg. Jenkins)
Off-course lot's of new features part of this release both on Governance side and also on Connectivity side. Here is official feature summary: Information Server 11.7 Feature Summary  and Preview DataStage Flow Designer

I'll write a series of blogs soon sharing detailed features of Information Server 11.7, usage of DataStage Flow Designer, using API(s) and many other interesting features.

Disclaimer: “The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.”

Thursday, February 13, 2014

DevOps - Automation and Continuous Testing is the Key

While leading the organization to adapt to continuous delivery and DevOps faced various challenges  where problem of many or 'automated' ---But'... or lengthy and sequential execution cycles with various reasons, not counting build size itself or network and geographical constraints. Test artifacts need to be copied across systems, components for different teams using varied test assets and if have to execute testes on frequently say hourly builds to identify failure immediately, quite challenging task. This holds the key to success though for both continuous integration and continuous delivery. Developer should get feedback on whether their code is "ok", using a quick personal build and test run immediately so developer can resolve it than working on something else (new function/logic), each line written impacts the system. On success, can run a complete set of tests to integrate the tested changes or integrate the changes and run the full set of tests based on time required. In any case quick turnaround is required and not possible to achieve it manually.

Now every one does automation, what is so special? Key is complete automation, with no 'But' or almost, system performing every task on its own and is various tools including of IBM hold the key here, providing seamless automation capabilities. It is for us to deliver the integrated Testing mechanism which not only perform tasks automatically but also reduce time required for testing even with increase in number of tests required to be executed.
 

 We tend to run everything but should be executed in the manner, run sanity test before full suite for individual components to ensure Build, Infrastructure, Configuration and other assets are working fine to minimal level of acceptance. Prioritize frequently failed test to be executed post Sanity to confirm current build is good for formal testing. It can lead to 'always working' test bucket with very little chance of failure, covering majority of bucket. With this we will have only a subset of test remain which either do not provide any major benefit or real time consuming and not required to run on every instance, reduce their execution frequency. DevOps also improve the process where we can  build and test the components separately and then perform integration testing enabling frequent testing of individual feature or component.


Concurrency or parallelism is another step where we can automate our Deployment of builds following DevOps model and automated provisioning/configuration enabling us to run tests across systems in parallel, providing results much faster. Following automated provisioning enable to use Template based approach, taking away need for install time and can snapshot areas where no new pieces need to be installed like databases / source systems.

 Stub(s) holds the key rather than for every small change we connect to external systems. This saves lots of time and tools like IBM Green Hat are great simulation tools. Consolidate Infrastructure using Cloud Based Deployment (Softlayer) to use local virtual systems where other test assets resided to remove network contentions and improve productivity.

 Will cover more of testing aspects and other DevOps related content including uDeploy, Chef, Softlayer in coming series.


-Ritesh 
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Tuesday, February 11, 2014

DevOps - Key to Continuous Delivery

While exploring Jenkins and continuous intergration pipeline thought of sharing on continuous development. Gone are the days of waterfall model where companies spend years to build products and clients further spend time in consuming them before moving to production. In the era of continuous change and need for something new every day, companies are moving towards quick release cycles. It started with mobile applications where new features gets delivered on regular basis on top of existing framework and can be seamlessly upgraded as well. With frequent release cycle requirements, an established mechanism need to be in place to avoid stress on resources and quality standards. It leads to a concept guided by open communication and  collaboration between software engineering and IT operation Experts.

DevOps is a model which enable enterprise to continuously deliver software to the market and seize  market opportunities by responding immediately to customer feedback or requirement with all quality checks in place. Following agile principles within software delivery life-cycle, DevOps enable organizations to achieve quicker time to value and provide scope for more innovation leading to easier maintenance cycles. Goal of DevOps model is move towards automation without need to enter anything manually and automation can be triggered by non-operation resources may be by system itself . It enable developers having more control on environment and can focus on research. DevOps leads to predictability, efficiency and maintainability of operational processes due to automation.

In the market different tools available to complete DevOps model including full set of DevOps offering from IBM which helps to achieve the process optimization and continuous development. IBM also offer DevOps as Service from Development to Deployment where developers can collaborate with others to develop, track, plan and deploy software. It is completely Cloud based and community based development via IBM BlueMix where can create applications with efficiency and deploy across domains. From planning to development and testing to release and deployment, all at 1 place following series of automation pieces to achieve machine based product delivery.

Will share more on steps and how to create Apps on BlueMix and following DevOps model along with Jenkins Series in upcoming blogs.


-Ritesh
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Thursday, January 23, 2014

Jenkins Series 2 - How to Create my First Job in Jenkins

Before share how to use Jenkins to build complex pipelines and create trends or build, sharing how to create your 1st Job in it.Click on the New Job. It will open a New Window with different Options.

Create a New Job




In the New Window you will see screen like below add the name of Job you want to create.
Provide the Name of the Job
Once Name is given, as a next step choose Add a Build Step. This is the place where you define what steps your Jenkins Job should perform. You can pick Windows or Unix Shell based on where you want to execute your command or scripts. In my example I have taken Windows. There are various other options available which you can play around once comfortable with Jenkins and you really need them.

Choose the Shell Want for Build Step

Once Shell is available add command or your batch file, shell script or any other steps want to perform. I have given only ls -l command and saved it. This is the place which really gets executed within the Slave (we discuss in other blog) a kind of compute node where process intended to be executed. In this case as you do not have anything registered other than master, it will run on the same machine where Jenkins is deployed and will provide you listing of workspace directory. It is a folder where Jenkins by default perform activities. You can use custom workspace as well (we discuss in other blog).




Once you click on Save will leads to following Screen and provide you Build Now Options
Click on Build Now to execute above step
On sucessfull execution you will see the results.
 Can click on the Console Output to see details of the Logs. Since we haven;t selected specific framework or selected reporting you will see console log by default for the steps executed above as it opens a Shell or Command Prompt and execute the steps.

Console Output

Hope now you have created your 1st Job on the Jenkins and ready for an interesting journey. Will discuss in coming blogs on how can register slave from the command line, what infect slave is, how we can consolidate results and run different kind of frameworks. So Stay Tuned.

-Ritesh
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions