19.2 Machine Learning in Practice

In the previous lesson we learned largely about concepts. Today, we will be practising what we have learned in a deeper fashion.

Elbow Warm Up

Let’s open up the file(s) in the 01-Ins_Elbow_Warm_Up folder to get started.

Students Do: Warm Up

Let’s open up the file(s) in the 02-Stu_Warm_Up folder to get started.

You might need to install these libraries to get your visualization to work:

Activate your conda dev: conda activate dev
Run: conda install -c pyviz hvplot
Then run your Jupyter Notebook: jupyter notebook

Scaling Data

Let’s open up the file(s) in the 03-Ins_Scaling_Data folder to get started.

What is Principal Component Analysis (PCA)?

PCA is a dimensionality reduction method used to reduce the number of input variables from a dataset for a ML model.

Why would we want to reduce the number of dimensions (input variables) in a ML model?

Supposed you’re building a model to predict who would win between 2 basketball teams. There are hundreds of parameters that we will need to consider in a game, such as field goal attempts, pass completion and so forth.

We know that to win a basketball game, the team stats must contribute to the team’s points. However, not all stats are weighed equal to contributing to points.

Defensive rebounds will weigh less to a team’s 3-points made if we want to create a model to predict who would win by points.

To treat every parameter as equal weights is a massive mistake in ML, and we would prefer to reduce the number of dimensions (input variables) in our model to:

Reduce noise from the data
Prevent overfitting
- Overfitting is a phenomenon where we train the model to be highly accurate with training data but it couldn’t replicate the accuracy when the model is used practically.
- Overfitting happens when the model we trained is too biased with our training data, and we’ll need to reduce the number of input variables to allow room for the model to variate properly.
- For example, NFL hail mary plays.
  - Hail mary is a desperation play, usually conducted by a losing team, to try to do a long pass touch down that has a very high risk of turning the ball over.
    - Between 2009 and 2020, there were 193 attempts with 16 produced touchdowns
  - If I’m predicting a future NFL game, should I consider hail mary plays as part of my model?
    - Typically not, because hail mary plays are outside the norm of regular games, and you can’t predict if it would happen in tomorrow’s game with today’s data. It will not be a principal component within my model.
    - This can only work if you’re doing real-time predictions for next play.

PCA will be covered heavily in the next lesson.

Preprocessing Data

Let’s open up the file(s) in the 04-Evr_Preprocessing folder to get started.

Students Do: Standardizing Stock Data

Let’s open up the file(s) in the 05-Stu-Standardizing_Stock_Data folder to get started.

Clustering Complex Data

Let’s open up the file(s) in the 06-Ins-Complex-Data folder to get started.

In ML, there are many ways to do clustering. We will look at BIRCH and Agglomerative Clustering.

What is hierarchical clustering?

Hierarchical clustering is about categorizing data using a tree. The hierarchies show the similar relationships of each other.

Balanced Interative Reducing and Clustering using Hierarchies (BIRCH)

Ref: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html

BIRCH is a top-down approach, where a single, all-inclusive cluster is broken down into k clusters.

BIRCH does not have an inertia score. Here’s a reference to clustering performance evaluation: https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation

Agglomerative Clustering

Ref: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering

Agglomerative clustering starts from the bottom up approach, as each point is considered as an individual cluster and it will merge the closest pair of clusters until K clusters are left.

Calinski-Harabasz Index (Variance Ratio Criterion)

Ref: https://scikit-learn.org/stable/modules/clustering.html#calinski-harabasz-index

A higher Calinski-Harabasz score relates to a model with better defined clusters.
The index is the ratio of the sum of between-clusters dispersion and of within-cluster dispersion for all clusters, where dispersion is defined as the sum of distance squared.
Advantages:
- Score is higher when clusters are dense and well separated.
- Score is fast to compute.
Drawbacks:
- Score is generally higher for convex clusters than other concepts of clusters, such as density based clusters.

Students Do: Segmenting Customer Data

Let’s open up the file(s) in the 07-Stu_Segmenting_Customers folder to get started.

In my notes, I created visualizations on how you can use the Calinski-Harabasz score to determine the optimal number of clusters.

In truth, data scientists use a variety of ways to determine the number of optimal clusters in unsupervised learning.

19.2 Machine Learning in Practice

Elbow Warm Up

Students Do: Warm Up

Scaling Data

What is Principal Component Analysis (PCA)?

Why would we want to reduce the number of dimensions (input variables) in a ML model?

Preprocessing Data

Students Do: Standardizing Stock Data

Clustering Complex Data

What is hierarchical clustering?

Balanced Interative Reducing and Clustering using Hierarchies (BIRCH)

Agglomerative Clustering

Calinski-Harabasz Index (Variance Ratio Criterion)

Students Do: Segmenting Customer Data

About The Author

Jonathan Moo

Leave a reply Cancel reply

Categories

19.2 Machine Learning in Practice

Elbow Warm Up

Students Do: Warm Up

Scaling Data

What is Principal Component Analysis (PCA)?

Why would we want to reduce the number of dimensions (input variables) in a ML model?

Preprocessing Data

Students Do: Standardizing Stock Data

Clustering Complex Data

What is hierarchical clustering?

Balanced Interative Reducing and Clustering using Hierarchies (BIRCH)

Agglomerative Clustering

Calinski-Harabasz Index (Variance Ratio Criterion)

Students Do: Segmenting Customer Data

About The Author

Jonathan Moo

Related Posts

14.1 Intro to JavaScript Visualization

23.1 Project 4

12.2 PyMongo and Advanced Queries

7.1 Projects and Collaboration With Git

Leave a reply Cancel reply

Categories