In the previous lesson we learned largely about concepts. Today, we will be practising what we have learned in a deeper fashion.
Elbow Warm Up
Let’s open up the file(s) in the 01-Ins_Elbow_Warm_Up
folder to get started.
Students Do: Warm Up
Let’s open up the file(s) in the 02-Stu_Warm_Up
folder to get started.
You might need to install these libraries to get your visualization to work:
- Activate your conda dev:
conda activate dev
- Run:
conda install -c pyviz hvplot
- Then run your Jupyter Notebook:
jupyter notebook
Scaling Data
Let’s open up the file(s) in the 03-Ins_Scaling_Data
folder to get started.
What is Principal Component Analysis (PCA)?
PCA is a dimensionality reduction method used to reduce the number of input variables from a dataset for a ML model.
Why would we want to reduce the number of dimensions (input variables) in a ML model?
Supposed you’re building a model to predict who would win between 2 basketball teams. There are hundreds of parameters that we will need to consider in a game, such as field goal attempts, pass completion and so forth.
We know that to win a basketball game, the team stats must contribute to the team’s points. However, not all stats are weighed equal to contributing to points.
- Defensive rebounds will weigh less to a team’s 3-points made if we want to create a model to predict who would win by points.
To treat every parameter as equal weights is a massive mistake in ML, and we would prefer to reduce the number of dimensions (input variables) in our model to:
- Reduce noise from the data
- Prevent overfitting
- Overfitting is a phenomenon where we train the model to be highly accurate with training data but it couldn’t replicate the accuracy when the model is used practically.
- Overfitting happens when the model we trained is too biased with our training data, and we’ll need to reduce the number of input variables to allow room for the model to variate properly.
- For example, NFL hail mary plays.
- Hail mary is a desperation play, usually conducted by a losing team, to try to do a long pass touch down that has a very high risk of turning the ball over.
- Between 2009 and 2020, there were 193 attempts with 16 produced touchdowns
- If I’m predicting a future NFL game, should I consider hail mary plays as part of my model?
- Typically not, because hail mary plays are outside the norm of regular games, and you can’t predict if it would happen in tomorrow’s game with today’s data. It will not be a principal component within my model.
- This can only work if you’re doing real-time predictions for next play.
- Hail mary is a desperation play, usually conducted by a losing team, to try to do a long pass touch down that has a very high risk of turning the ball over.
PCA will be covered heavily in the next lesson.
Preprocessing Data
Let’s open up the file(s) in the 04-Evr_Preprocessing
folder to get started.
Students Do: Standardizing Stock Data
Let’s open up the file(s) in the 05-Stu-Standardizing_Stock_Data
folder to get started.
Clustering Complex Data
Let’s open up the file(s) in the 06-Ins-Complex-Data
folder to get started.
In ML, there are many ways to do clustering. We will look at BIRCH and Agglomerative Clustering.
What is hierarchical clustering?
- Hierarchical clustering is about categorizing data using a tree. The hierarchies show the similar relationships of each other.
Balanced Interative Reducing and Clustering using Hierarchies (BIRCH)
Ref: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html
BIRCH is a top-down approach, where a single, all-inclusive cluster is broken down into k clusters.
BIRCH does not have an inertia score. Here’s a reference to clustering performance evaluation: https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation
Agglomerative Clustering
Agglomerative clustering starts from the bottom up approach, as each point is considered as an individual cluster and it will merge the closest pair of clusters until K clusters are left.
Calinski-Harabasz Index (Variance Ratio Criterion)
Ref: https://scikit-learn.org/stable/modules/clustering.html#calinski-harabasz-index
- A higher Calinski-Harabasz score relates to a model with better defined clusters.
- The index is the ratio of the sum of between-clusters dispersion and of within-cluster dispersion for all clusters, where dispersion is defined as the sum of distance squared.
- Advantages:
- Score is higher when clusters are dense and well separated.
- Score is fast to compute.
- Drawbacks:
- Score is generally higher for convex clusters than other concepts of clusters, such as density based clusters.
Students Do: Segmenting Customer Data
Let’s open up the file(s) in the 07-Stu_Segmenting_Customers
folder to get started.
In my notes, I created visualizations on how you can use the Calinski-Harabasz score to determine the optimal number of clusters.
- In truth, data scientists use a variety of ways to determine the number of optimal clusters in unsupervised learning.