Last few quarters having discussion with different Industry leaders around Data Integration and its future. During discussion I realised few of them have misconceptions around ETL. They query me about future of ETL and cross check is it really dead or outdated?
I wasn't surprised. Their concerns are genuine as they relies on ETL processing for their business reporting. ETL dead or outdated topic keeps coming every few quarter with new product comes into market and need to find market for itself. This topic is not new as even in 2010-11 I had written multiple times on this topic. This time I decided to be put things into perspective and address some of the misconceptions. I am referring “Kafka” as an example where different field experts and vendors using it as reference and claim “ETL is dead” :-).
Let me provide some basic clarity on ETL. ETL is a concept and it was there for decades and will remain as long as data processing in any form remains. ETL is nothing but Extract, Transform and Load. ETL is nothing but Data Integration and Governance and it gets serviced by multiple vendors with their tools. Historically ETL tools (not ETL) replaced SQL use inside the enterprises to streamline their process. “Reiterating” ETL tools haven’t replaced SQL but complimented its extensive use with automation and governance with additional features. ETL became ELT, TELT, TETL but end result was still same Data Integration or ETL.
Let me focus on "T" of ETL before addressing E & L. This “T” contains everything about data enrichment. Simple, complex, continuous, dynamic, realtime all possible based on how vendor deliver their capability. Yes it is manual, error prone, trial and learn and need to evolve. Now “E” can be from anywhere. It depends on capability of tool and not of ETL concept on what sources Vendor think important for it to support and address its client’s requirement. Same is true for “L”. For these tools Kafka is also another source or target and it is what Kafka claims anyway.
Last 5 years people were focusing on real time and “process immediate” requirements. Real time processing doesn’t take ETL away. It still need to process incoming data by transforming it (enriching it) and finally loading data which makes sense as par policy requirements or future requirement. Now target can be anywhere.
If enterprise has to calculate profit of last quarter for a region, you still need to pull specific data and process it. It is not real time data but result can be real time based on real-time processing. Processing of data and providing results “now” using in-memory processing still doesn't kill ETL as you still extracted it and transformed it by applying business logic of calculation. Use ETL tool, SQL or Kafka is the choice enterprise need to make. How much manual or automation and governance they want to have is their choice. It doesn’t make ETL tool batch, realtime or out dated. It depends on how its runtime works and process data internally. Can it work as a continuous pipe which can always be updated and scale more like DataStage PX Engine? Evolution of ETL tools is driven by Industry usage and it is happening.
Recently saw multiple blogs around Kafka and how efficiently it is replacing ETL with real time processing. I am not even sure people commenting ETL dead ever looked at EDW or even Cassandra or Hive data stores or even understood why fact table is required and what problem is solve. I agree with multiple people unaware of real usage of ETL tools continues to use SQL behind the scene and maintain scripts even though these tools were meant to remove manual processes or scripts.
Bringing Kafka into play doesn’t make any processing realtime. Once you process the event from Kafka It is tool or code which process data for further consumption. Now if enterprise hasn’t moved its script based execution even within ETL tools, why they will do it for Kafka? And even if they are going to do it, it will again be ETL architecture but yes will look cleaner ETL architecture and hence ETL not dead. Yes you can say in current form existing ETL tools are outdated and they need to evolve. DataStage Datasets used to provide mechanism to store transformations for future use and load to any target with out any need to re-apply business logic for last twenty years. I am sure other tools have it as well.
ETL tools process Terabytes of data daily basis. So Kafka processing billions of rows not a surprise. Is it really Kafka scaling and transforming these billions of records or engines processing these records and providing insight about those events and storing them are driving change? Kafka trying to become another ETL tool by providing extract and load connectors using connectAPI and transform features for enrichment. So yes , another ETL tool which few are referring to say “ETL is dead / outdated”
Even events coming to Kafka need to be loaded by some tool or connector to extract from Kafka. Now it can be in any format (e..g JSON), need some transformation layer to pick specific content for processing or join or lookup or even removing unwanted content (e.g. duplicate), update missing content intelligently and finally push it to Hadoop or Cloud and even to Kafka so other process can consume it. It is nothing but “ETL”. Kafka is not a storage but a pipe with in and out or source and sink. Prior to Kafka even MQ, JMS and other tools in the past provided this kind of processing.
Like in the past when users preferred to create SQL(s) and maintain their scripts instead of efficiently use ETL tools, engineers trying to learn new technologies. It is a welcome change. But arrival of new tool or language doesn’t replace existing one as is with Java and C (still going strong) or Perl or other scripts. But this doesn’t make ETL dead, old or out dated. Tools can be outdated or replaced with advanced ones provided they scale well and address the problem. Even existing ETL tools drive better Kafka usage than Kafka along with its connect API by consuming and enriching data directly within its pipeline and push to Kafka. Other pipeline can take this message / event and perform required enrichment if required and then push it to target.
You can’t artificially create a problem to solve with new tool :-)
I agree vendors owning ETL tools need to drive architectural changes with more innovation to address new challenges and adapt to new kind of processing requirement. But please stop using term “ETL is Dead” and instead can say “Long Live ETL” or Data integration :-) as it integrate data of all available sources will continue to do so as long as data processing required.
Next one on RDBMS, Hadoop and Cloud :-) and may be Spark / NiFi / SAM