Sylvain Zyssman Head Of Engineering - DataOps - Team leader - Challenge seeker | I help companies build scalable, end-to-end, software & data solutions

ETL & ELT

Disclaimer : this website offers a metaphorical approach to big data, data science and data engineering concepts. They are purposely simplified and some of them may be eluded or approximative.

ETL and ELT are two processes based on the same steps, although executed in a different order.

E: Extract

T: Transform

L: Load

Extract

The first step of our process is to extract data. In the data world, we extract and gather data from different sources : APIs, csv files, databases dumps… for a later use.

In our shop, data is made of our ingredients. When we go shopping to buy them, we may go from store to store to get what we need. We are extracting ingredients from shops and gathering them for a later use.

In the data world, the Extract phase is brutal. We extract all what our sources have to offer. No time to deal with cleaning data, for now.

Messy shelves to represent raw data — Raw data is like these shelves. They contain all we need but are messy and unpractical.

Same goes with our shopping. Let’s say we forgot our ingredients list, we’ll then buy some stuff we may not need, or gather more ingredients that we would actually need. (who uses 1kg of sugar in a cake anyway ^^)

Now that we have out data, let’s head for the next step. But should it be a T or a L?

Transform or Load?

Back in the time, storage was expensive. We had to process (Transform) the data beforehand to only store high quality data. But with the increasing popularity of cloud solutions, which started in the 2000’s, storage became less and less of a problem. We can even have unlimited storage, for a few cents a Go / transaction.

Loading data before transforming it became the standard, in the majority of the scenarios.

Load

Now that we have retrieved our data (or our ingredients for that matter), we have to load them to their point of storage. For real data, cloud providers offer a lot of solutions, most famous being Amazon S3, Google Buckets and Azure storage.

Our data is still raw, just like our ingredients (remember at this time we just bought flour, sugar, chocolate, cream etc.) We have everything in its own packaging, coming from different sources.

We take them back home an store them “as is” in our shelves.

Transform

OK, time to cook. This phase depends on what we want to do with our data / ingredients. A lot of recipes use the same ingredients for different results.

An egg, for exemple, can be cooked, boiled, whipped… Same ingredient, different transformation…

Same goes with data. A JSON file coming from an API can be read and loaded into a dataframe to train Machine Learning models, or parsed to go into a database. We can transform it with aggregation, do some maths with it to create new data… (just like cream can become butter, with the relevant transformation).

It all depends on what our final point is. We can do cupcakes, croissants, wedding cakes, donuts… all with the “same” ingredients.

Well, with transformed text for instance, we can use our data to build a Business Intelligence platform for our users, train a sentiment analysis model, develop a brand new API etc.

Consuming the data

Quick recap. We Extracted our data (bought the ingredients), Loaded it (took the ingredients at home and stored them) and Transformed it (baked some cakes).

Now our users/customers are able to consume our data.

Pastry showcase, like our qualified data — Transformed data can now be showcased and consumed by our users

As our customers would enjoy our fresh and fancy cakes, we can provide an API, create custom analytics, or use Business Intelligence tools (such as Power BI, Tableau, Looker…) etc for our end-users’ needs.

Images from Unsplash (Shayna Douglas, Annie Spratt, Evonne Yuwen)

01 Apr 2022

Batch & Stream processing »