Batch & Stream processing
One of the most important aspects of data processing in Data Engineering is the Lambda architecture. This concept is based on 2 methods. The choice of one or the other will depend on the insights you need.
Stream processing
Back to our bakery. Throughout the day, customers come into the store and buy our products.
From time to time, the boss drops by. He wants to have an (almost) “real time” overview of the store’s performance.
To meet his need, and since the visits are unplanned, the employees write down each of their sales on a whiteboard. The data arrives in real time (after each sale) and the analysis is instant.
The analysis of the data and its visualization (“dataviz”) is of course based on the data collected. It can be simply the number of sales, but also by which employee, which type of product, at which period or time of the day, which type of customer (socio-demographic data), which type of payment…
Batch processing
While stream processing can provide a real-time overview, it does not easily give a clear, in-depth perspective on the data.
It is difficult to imagine a gigantic whiteboard on which all the transactions of the year would be recorded in detail.
So we store the data and analyze it in “chunk”. A batch of a week, a month, a year of sales… We can make analysis on larger data and have a global vision, see the “big picture”.
Of course, nothing prevents stream processing over long periods of time, assuming you have the necessary resources and adapt your workflow with the appropriate tools.
Stream or batch ?
It is often recommended to start with batch processing, which is considered as a solid base for a performing big data platform.
The process is simple, easy to set up and pretty low cost, allowing to get a global view on data, customers…
As the business grows, and if the need for real-time analysis arises, we can then add a stream pipeline to our big data platform.