ETL is a very basic process to collect data for analysis. It doesn’t matter which role you want to take, either as a scientist, analyst or engineer. Every role needs to be able to perform basic ETL processes in any company.
The only difference is the depth by which you need to collect data.
As a data engineer, you need to ace ETL to the point of building production pipelines for live services. We aren’t covering some of the depths simply because there is a alot of engineering involved, and in this course, we try to get to the value of the data rather than be too distracted with engineering.
As a data analyst or scientist, you need to do ETL to conduct exploratory data analysis (EDA) for your work.
What is ETL?
As you can clearly see, each letter of ETL is a process by which we perform data collection:
- Extract – To gather the data from sources into your process.
- Transform – To mold them into a form that is ready for ingestion
- Load – To ingest the data into a database
There are many variants of data collection, and ETL is one of them. Others could be Extract-Load-Transform (ELT), because the speed where data is generating far outweighs the importance of transforming the data first, and hence we would extract and load them into a data warehouse or a NoSQL database.
What the school teaches is a simple ETL process that you can do on your local machine. At work, you might use more sophisticated tools such as Airflow or services like Databricks to perform ETLs, but its concept does not change from what you can do in your local machine.
Do you know you’re doing ETLs already?
Data Transformation and Cleaning
Let’s open up the file(s) in the 01-Ins_Data_Transformation_and_Cleaning
to get started.
We have been used to ingesting data from CSVs, but Pandas is able to ingest data through Excel. This is outside of the norm though, unless the company you work with has a lot of legacy systems.
Students Do: Transform and Clean Online Orders
Let’s open up the file(s) in the 02-Stu_Transform_and_Clean_Online_Orders
to get started.
List Comprehensions Review
Let’s open up the file(s) in the 03-Ins_List_Comprehensions
to get started.
In this example, we are transforming each row with a simple logic through a for loop.
This might not be the most efficient way of transformation though, especially if you have millions of records. We would optimize it with vector math, but that’s outside the scope of the course.
Students Do: List Comprehensions Review
Let’s open up the file(s) in the 04-Stu_List_Comprehensions
to get started.
Transform and Clean Grocery Orders
Let’s open up the file(s) in the 05-Evr_Transform_and_Clean_Grocery_Orders
to get started.
Basic Regex Pattern Matching
Let’s open up the file(s) in the 06-Ins_BasicRegex_PatternMatching
to get started.
Regular expressions, Regex for short, is all about manipulating string through finding patterns.
It is particularly powerful and important especially if you need to do a lot of text analysis and clean up. However, it takes a lot of getting used to, and even if you do know it, you can get rusty with it if you don’t use it often.
However, it has little bearings on data analysis if you don’t deal with a lot of text manipulation. I wouldn’t sweat it, and I only dive into it when I need to.
Regex Matching with Pandas
Let’s open up the file(s) in the 07-Evr_Regex_Matching_Pandas
to get started.
ETL Mini Project Overview
This is to be done in pairs or threes and it covers Project 2 of your course.