This week’s module is probably why you took up a data analytics course in the first place. It is relevant to today’s data analytics world, and it is both a blend of analysis and experiencing a new toolkit for work.

Introduction to Jupyter Notebook

Let’s open up the README.md file in the 01-Ins_Jupyter_Intro folder to get started.

Jupyter notebook is a data analyst and scientist playground, not so much engineering. This is a preferred tool because:

  • The User-Interface (UI) is intuitive where blocks of code can be run and analyzed with ease
    • Your variables that you declared earlier in a code block can be referenced within the Jupyter Notebook.
    • Each code block can have a specific objective, and you can make it very organized.
  • We regularly perform Exploratory Data Analysis (EDA) and Data Gap Analysis (DGA), and perform rapid prototyping before building a machine learning (ML) pipeline.
    • This is to reduce the risk of failure as building ML pipes are expensive in time and resources.

Jupyter notebooks isn’t how data engineering is done because:

  • The code cannot be tested easily, although in recent years they have enabled code testing on Jupyter notebooks.
  • You need a specific server to run Jupyter Notebooks where you can benefit from the UI.
  • There are exceptions where Jupyter Notebooks are used for engineering, especially if you’re using specific tools such as Data Bricks on AWS, but generally we stick to Python due to cost.

All in all, Jupyter Notebook runs primarily on Python, and if you can run it on Jupyter Notebooks, you can convert them into Python scripts.

To activate your Jupyter Notebook

  1. You will need to activate your conda dev environment.
    • If you have not installed it, run: conda create -n dev python=3.10 anaconda -y
    • To activate your dev environment, run: conda activate dev
    • To deactivate your environment, run: conda deactivate
    • If you don’t have a virtual environment, you will encounter this error: Jupyter command `jupyter-notebook` not found.
  2. Go to your Gitlab repo, at the Activities folder for the lesson 4.1 lesson in your Terminal/Git Bash.
  3. Run: jupyter notebook

Students Do: Comic Book Remix

Introduction to Pandas

Let’s open up the file in the 03-Ins_Pandas_Intro folder to get started.

Pandas is a Python library that we use extensively for data analysis and science. In layman terms, it stores data into tables called data frames, and its ease of use is why this is a staple for data analysis and science in Python.

Students Do: DataFrame Shop

Instructor Do: DataFrame Functions

Let’s open up the file in the 05-Ins_Data_Functions to get started.

From here, you’ll start to appreciate why adequate coding skills is needed to perform data analysis. You don’t need to be a software engineer to create applications, but you do need enough skills to manipulate data.

Students Do: Training Grounds

Instructor Do: Modifying Columns

Let’s open up the file in the 07-Ins_Column_Manipulation to get started.

Students Do: Hey Arnold!

Reading and Writing CSV Files

Let’s open up the file in the 09-Ins_Reading_Writing_CSV to get started.

We have learned how to read and write into files, and so why are we using this library instead?

  • Python read and write libraries are for generic purposes, while Pandas is specifically for tabular data.
    • Thus, you will find it much easier to use than a normal read/write Python library.
    • You might still need the generic purposed read and write libraries for other business cases.

Students Do: Comic Books Part 1

Students Do: Comic Books Part 2