We’ll be diving deeper in PCA and how you can apply it on your ML work.

Let’s recap what is unsupervised learning here.

  • The data comes in unlabeled, which means you do not know what are the outcomes of these data sets.
  • Your task as a data analyst is to:
    • Identify the patterns and relationship by clustering and categorizing the data
    • Analyzing why and how these patterns and relationship contribute to the business outcomes
    • From the segmentation and cluster analysis, determine action items to your cause.

Introduction to PCA

Let’s open up the file(s) in the 02-Ins_PCA folder to get started.

We are going to review a Youtube link for starters: https://experiments.withgoogle.com/visualizing-high-dimensional-space

This is the sci-kit library we’ll be using in Python: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

As what you have witnessed from the video, there are a lot of data points to understand the meaning of sentences and paragraphs in a text.

The purpose of PCA is to reduce the data that has little value (noise) while preserving the rest of the data that significantly impacts our ML model.

Here’s a video link to explain the math behind PCA:

It is too long to cover in class, and we don’t have to physically calculate the values to make it work.

How does PCA work in a nutshell?

PCA works by creating a variance ratio score where we want to find the dimensions that contribute most to the variability of the data set. We do this by:

  • Finding the center of the data set using mean between the dimensions or principal components.
  • Finding the best fitting line between the components
    • The best fitting line is called the eigenvector, where it makes up the variance ratio of the components. We start of with PC1.
  • Creating an axis where we draw a perpendicular line to indicate the eigenvector of PC2.
  • The sum of distance between the new origin and the data points accounts for the variation score betweenn PC1 and PC2.
  • There is no limit to the number of principal components we can have, as long as it is smaller than the number of dimensions you fit into the model.
    • However, you can’t plot effectively beyond 3 dimensions.
  • Always scale the data before applying PCA.
    • Because it is using the sum of distances within the data points, values can be artificially enlarged when we do not consider their units of measurement.

How do we choose the number of components for PCA?

  • Doing additional analysis and determine which are the factors that contribute to the output. We can then indicate the number of components to use.
  • Using a percentage within your PCA, such as: pca = PCA(n_components = 0.95), where:
    • The algorithin will determine the minimal number of components that account for 95 percent variability of the data.