"ETL is dead; long-live streams"​ is a false statement

"ETL is dead; long-live streams" is a false statement

If someone in your organization is pushing for real-time processing for everything use this analogy: "We humans eat, then processed food, and then sh.. everything is done in batches... Do not even try to image this in real-time processing (it will look ugly)" real-time processing is great however very often some tasks are required to be batch so it would be wasteful or "ugly" to push for real-time processing.

Some good examples of real-time processing

  1. Personalizing the shopping experience: By analyzing data in real time, e-commerce websites can recommend products or show targeted ads to individual customers based on their past purchases and browsing history.
  2. Fraud detection: Real-time data processing can help e-commerce companies detect and prevent fraudulent transactions as they happen.
  3. Inventory management: Real-time data processing can help e-commerce companies track and manage their inventory in real time, ensuring that products are in stock and available for purchase.

Some good examples of batch processing

  1. Financial reporting: E-commerce companies can use batch processing to generate financial reports on a regular basis, such as daily, weekly, or monthly sales reports.
  2. Data analysis: Batch processing can be used to analyze large amounts of data, such as customer purchase histories or website traffic data, to identify trends and patterns.
  3. Customer segmentation: Batch processing can be used to segment customers based on their past purchases or other characteristics, allowing e-commerce companies to target their marketing efforts more effectively.

Batch data processing and real-time data processing are two methods of processing data that should be used together to improve the value of a business by using a combination of these methods will help a business make more informed decisions and respond more quickly to changing circumstances.

Going to extremes will be wasteful and ugly. If you hear extremes of one over another in your organization you might have a problem or lack of understanding.

Update 2023-01-06

I have received more attention than I should expect. With both positive and negative feedback.

I would like to clarify one thing: Then talking about batch processing and real-time processing I have a very particular example in my head:

"Manager is coming and asking to build you a quarterly report - but in real time", and here we as data engineers must stand our grant and explain what chasing keywords in real-time is not effective because some reports can't be built in real time.

I do understand my initial text didn't specify this very important peace and it's only me alone who is responsible for it.

Christine Matousek

Chapter Lead - Data and AI Architecture (AI/ML | Modern Data Architecture | Integration)

2y

The root cause issue is… companies struggle to understand their own use cases, what they are trying to solve primarily. If they fully understood, they would see that most use cases are batch related, not real time needs.

Werner Daehn

BSc Digital Transformation, Chief Software Architect for Data Integration and Big Data

2y

Point 3: The question for your business is, if you know something, how long does the business process take to make a counter action? If the answer is it takes a day or more today, a follow up question is, if that is desired. Shouldn't the process be optimized instead? A new record is entered in the healthcare system: Toxicological report shows an emergency condition. This data is processed in batch, 4 hours later. A day later. 15 minutes later. Patient died meanwhile. A customer ordered 100l oil for cooking. Very unlikely for a private person. The DWH shows that tomorrow but the barrel has been shipped meanwhile. The sales system marked an order as potentially fraudulent. Tomorrow somebody will decide, that it would have been better to not ship. A little bit late.

Werner Daehn

BSc Digital Transformation, Chief Software Architect for Data Integration and Big Data

2y

Point 2: Can Realtime tools be as efficient as batch? That is a though one. A join is a good example of why realtime is problematic. Two line items are changed for a sales order. The stream gets two records hence and should output the joined result. Processing the join once per line item will be expensive. Waiting for 15 minutes and microbatch it is realtime enough. If both line items are changed in the same database transaction, then the stream should join them as one. The optimal solution. Efficient and low latency. Event streaming is a good example why realtime is way more efficient than batch. In event streaming you get the changes. In batch you have to ask the source system for the changes. How often do you read the entire source table just to figure out there was no change at all? How long does you longest delta-job run? Multiple hours? That should be called efficient?

Werner Daehn

BSc Digital Transformation, Chief Software Architect for Data Integration and Big Data

2y

Point 1: If Realtime would be as easy to build and is even more efficient, would you still go for batch processing? According to your analogy, the answer would be No. Please explain the business reasons why you still want to use batch processing.

Uli Bethke

Follow me for SQL Data Pipelines, Snowflake, Data Engineering, XML Conversion

2y

real time is necessary for some edge cases where you need to make automated, well defined and predefined decisions in real time. Most scenarios and use cases just require batch processing though. Implementing real time for everything is nuts. It is complex, brittle and as a result expensive for no added benefits.

To view or add a comment, sign in

More articles by Arturas Tutkus

  • Storage choose the right one

    In the book Fundamentals of Data Engineering, storage is one of the many covered topics as it plays a critical role in…

  • Do you know your data's ROI?

    In today's digital age, data is often seen as a valuable asset, and it's not uncommon for organizations to accumulate…

    1 Comment
  • Don't forget to migrate to GA4

    Some time ago Google has announced that Universal Analytics, also known as Google Analytics, will no longer process new…

  • Snowflake's UDF - array_like

    I'm working on a data model in which I have an array column. Long story short I need to select rows which would contain…

    7 Comments
  • MySQL 8 Window functions - best feature for me

    MySQL an open-source relational database management system has just released 8th version (Stable release: 8.0.

    1 Comment
  • Data engineer pipeline - from pixel to pixel

    Software engineers have term full stack. Maybe data engineers should have something similar? What about `pixel to…

  • Pitfall of (cheap) machine learning

    Have you heard a story of how to kill trending video on youtube? 1. You subscribe thousands of blog accounts to channel…

Insights from the community

Others also viewed

Explore topics