In the previous lesson we learned largely about concepts. Today, we will be practising what we have learned in a deeper fashion.

Elbow Warm Up

Let’s open up the file(s) in the 01-Ins_Elbow_Warm_Up folder to get started.

You might need to install these libraries to get your visualization to work:

  1. Activate your conda dev: conda activate dev
  2. Run: conda install -c pyviz hvplot
  3. Then run your Jupyter Notebook: jupyter notebook

Scaling Data

Let’s open up the file(s) in the 03-Ins_Scaling_Data folder to get started.

What is Principal Component Analysis (PCA)?

PCA is a dimensionality reduction method used to reduce the number of input variables from a dataset for a ML model.

Why would we want to reduce the number of dimensions (input variables) in a ML model?

Supposed you’re building a model to predict who would win between 2 basketball teams. There are hundreds of parameters that we will need to consider in a game, such as field goal attempts, pass completion and so forth.

We know that to win a basketball game, the team stats must contribute to the team’s points. However, not all stats are weighed equal to contributing to points.

  • Defensive rebounds will weigh less to a team’s 3-points made if we want to create a model to predict who would win by points.

To treat every parameter as equal weights is a massive mistake in ML, and we would prefer to reduce the number of dimensions (input variables) in our model to:

  • Reduce noise from the data
  • Prevent overfitting
    • Overfitting is a phenomenon where we train the model to be highly accurate with training data but it couldn’t replicate the accuracy when the model is used practically.
    • Overfitting happens when the model we trained is too biased with our training data, and we’ll need to reduce the number of input variables to allow room for the model to variate properly.
    • For example, NFL hail mary plays.
      • Hail mary is a desperation play, usually conducted by a losing team, to try to do a long pass touch down that has a very high risk of turning the ball over.
        • Between 2009 and 2020, there were 193 attempts with 16 produced touchdowns
      • If I’m predicting a future NFL game, should I consider hail mary plays as part of my model?
        • Typically not, because hail mary plays are outside the norm of regular games, and you can’t predict if it would happen in tomorrow’s game with today’s data. It will not be a principal component within my model.
        • This can only work if you’re doing real-time predictions for next play.

PCA will be covered heavily in the next lesson.

Preprocessing Data

Let’s open up the file(s) in the 04-Evr_Preprocessing folder to get started.

Clustering Complex Data

Let’s open up the file(s) in the 06-Ins-Complex-Data folder to get started.

In ML, there are many ways to do clustering. We will look at BIRCH and Agglomerative Clustering.

What is hierarchical clustering?

  • Hierarchical clustering is about categorizing data using a tree. The hierarchies show the similar relationships of each other.

Balanced Interative Reducing and Clustering using Hierarchies (BIRCH)

Ref: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html

BIRCH is a top-down approach, where a single, all-inclusive cluster is broken down into k clusters.

BIRCH does not have an inertia score. Here’s a reference to clustering performance evaluation: https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation

Agglomerative Clustering

Ref: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering

Agglomerative clustering starts from the bottom up approach, as each point is considered as an individual cluster and it will merge the closest pair of clusters until K clusters are left.

Calinski-Harabasz Index (Variance Ratio Criterion)

Ref: https://scikit-learn.org/stable/modules/clustering.html#calinski-harabasz-index

  • A higher Calinski-Harabasz score relates to a model with better defined clusters.
  • The index is the ratio of the sum of between-clusters dispersion and of within-cluster dispersion for all clusters, where dispersion is defined as the sum of distance squared.
  • Advantages:
    • Score is higher when clusters are dense and well separated.
    • Score is fast to compute.
  • Drawbacks:
    • Score is generally higher for convex clusters than other concepts of clusters, such as density based clusters.

In my notes, I created visualizations on how you can use the Calinski-Harabasz score to determine the optimal number of clusters.

  • In truth, data scientists use a variety of ways to determine the number of optimal clusters in unsupervised learning.