This week’s module is probably why you took up a data analytics course in the first place. It is relevant to today’s data analytics world, and it is both a blend of analysis and experiencing a new toolkit for work.
Introduction to Jupyter Notebook
Let’s open up the README.md
file in the 01-Ins_Jupyter_Intro
folder to get started.
Jupyter notebook is a data analyst and scientist playground, not so much engineering. This is a preferred tool because:
- The User-Interface (UI) is intuitive where blocks of code can be run and analyzed with ease
- Your variables that you declared earlier in a code block can be referenced within the Jupyter Notebook.
- Each code block can have a specific objective, and you can make it very organized.
- We regularly perform Exploratory Data Analysis (EDA) and Data Gap Analysis (DGA), and perform rapid prototyping before building a machine learning (ML) pipeline.
- This is to reduce the risk of failure as building ML pipes are expensive in time and resources.
Jupyter notebooks isn’t how data engineering is done because:
- The code cannot be tested easily, although in recent years they have enabled code testing on Jupyter notebooks.
- You need a specific server to run Jupyter Notebooks where you can benefit from the UI.
- There are exceptions where Jupyter Notebooks are used for engineering, especially if you’re using specific tools such as Data Bricks on AWS, but generally we stick to Python due to cost.
All in all, Jupyter Notebook runs primarily on Python, and if you can run it on Jupyter Notebooks, you can convert them into Python scripts.
To activate your Jupyter Notebook
- You will need to activate your conda dev environment.
- If you have not installed it, run:
conda create -n dev python=3.10 anaconda -y
- To activate your dev environment, run:
conda activate dev
- To deactivate your environment, run:
conda deactivate
- If you don’t have a virtual environment, you will encounter this error:
Jupyter command `jupyter-notebook` not found.
- If you have not installed it, run:
- Go to your Gitlab repo, at the
Activities
folder for the lesson 4.1 lesson in your Terminal/Git Bash. - Run:
jupyter notebook
Students Do: Comic Book Remix
Let’s open up the file in the 02-Stu_Comics_Remix_Jupyter
to get started.
Introduction to Pandas
Let’s open up the file in the 03-Ins_Pandas_Intro
folder to get started.
Pandas
is a Python library that we use extensively for data analysis and science. In layman terms, it stores data into tables called data frames
, and its ease of use is why this is a staple for data analysis and science in Python.
Students Do: DataFrame Shop
Let’s open up the README.md
file in the 04-Stu_DataFrameShop_Pandas
folder to get started.
Instructor Do: DataFrame Functions
Let’s open up the file in the 05-Ins_Data_Functions
to get started.
From here, you’ll start to appreciate why adequate coding skills is needed to perform data analysis. You don’t need to be a software engineer to create applications, but you do need enough skills to manipulate data.
Students Do: Training Grounds
Let’s open up the file in the 06-Stu_TrainingGrounds_DataFunctions
to get started.
Instructor Do: Modifying Columns
Let’s open up the file in the 07-Ins_Column_Manipulation
to get started.
Students Do: Hey Arnold!
Look at the file(s) in the 08-Stu_Hey_Arnold_DataFrame_Formatting
folder.
Reading and Writing CSV Files
Let’s open up the file in the 09-Ins_Reading_Writing_CSV
to get started.
We have learned how to read and write into files, and so why are we using this library instead?
- Python read and write libraries are for generic purposes, while Pandas is specifically for tabular data.
- Thus, you will find it much easier to use than a normal read/write Python library.
- You might still need the generic purposed read and write libraries for other business cases.
Students Do: Comic Books Part 1
Look at the README.md
to start your activity in the 10-Stu_Comic_Books_CSV
folder.
Students Do: Comic Books Part 2
Look at the README.md
to start your activity in the 11-Stu_Comic_Books_Summary
folder.