This tutorial is for:
- folks who have basic Python skills
- folks who know basic MySQL
Extract-Transform-Load (ETL) is a common term to ingest data. This is important to automate menial tasks and get things done while you’re sleeping. This is also a basic skill that I believe any data engineer or scientist should know.
Failure to know this is catastrophic. You don’t know how to create pipelines to harvest data for your analytical work. As I manage a Data Science team, anybody who does not master this in my team will lose their job. It’s too simple not to know this skill well.
As the name suggests, it has 3 processes:
- Extract: To get data from a source. It could be from an API, or from scraping websites, or from files.
- Transform: To manipulate the data such that you can store it in a database.
- Load: To effectively send data into a database.
For this tutorial, the goal is to create a Python script that would:
- Extract data from a source
- Transform the data which will fit into a table schema
- Our context here is a relational database, and the system we are using is MySQL.
- Before we talk about a table schema, we also want to touch upon some basic knowledge of how we create a table schema.
- Load the data into the above table within a MySQL database.
However, just having a Python script that could do ETL for you isn’t enough. The goal is to make sure your servers are working while you’re sleeping. Thus we are going to use a software utility called ‘cron’ to automate the process.
What is Cron?
Cron helps you to run commands on your system at set intervals of time, or a specific time that you want. These scheduled commands or tasks are known as ‘cron jobs’. Its application is unlimited; you can run processes to backup systems, update or delete files periodically when required. It doesn’t have to apply itself to only doing ETL jobs.
For this tutorial, we’re going to use a Linux operating system called Ubuntu. It’s too much information to deal with what is Linux, but basically Ubuntu is like Windows or macOS. It’s an open-source operating system that allows users to interface with a server easily. And because it’s open-source, it means we don’t have to pay licensing fees to use the system.
Your company may have different requirements, and they may use other Linux distributions such as CentOS or Fedora. Windows 10 has an Ubuntu shell if you enable it.
Ubuntu has a Graphical User Interface (GUI), but because I’m trying to educate budding data engineers or scientists, we are going to stick to the terminal.
In the following tutorials, we will do a walkthrough of how you can automate menial tasks and get data into a relational database with cron. It will be broken up into several posts, and so stay tuned for more updates.