We’ll be diving deeper in PCA and how you can apply it on your ML work.

Let’s recap what is unsupervised learning here.

- The data comes in unlabeled, which means you do not know what are the outcomes of these data sets.
- Your task as a data analyst is to:
- Identify the patterns and relationship by clustering and categorizing the data
- Analyzing why and how these patterns and relationship contribute to the business outcomes
- From the segmentation and cluster analysis, determine action items to your cause.

## Warm Up

Let’s open up the file(s) in the `01-Warm_Up`

folder to get started.

## Introduction to PCA

Let’s open up the file(s) in the `02-Ins_PCA`

folder to get started.

We are going to review a Youtube link for starters: https://experiments.withgoogle.com/visualizing-high-dimensional-space

This is the sci-kit library we’ll be using in Python: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

As what you have witnessed from the video, there are a lot of data points to understand the meaning of sentences and paragraphs in a text.

The purpose of PCA is to reduce the data that has little value (noise) while preserving the rest of the data that significantly impacts our ML model.

Here’s a video link to explain the math behind PCA:

It is too long to cover in class, and we don’t have to physically calculate the values to make it work.

#### How does PCA work in a nutshell?

PCA works by creating a variance ratio score where we want to find the dimensions that contribute most to the variability of the data set. We do this by:

- Finding the center of the data set using mean between the dimensions or principal components.
- Finding the best fitting line between the components
- The best fitting line is called the eigenvector, where it makes up the variance ratio of the components. We start of with PC1.

- Creating an axis where we draw a perpendicular line to indicate the eigenvector of PC2.
- The sum of distance between the new origin and the data points accounts for the variation score betweenn PC1 and PC2.
- There is no limit to the number of principal components we can have, as long as it is smaller than the number of dimensions you fit into the model.
- However, you can’t plot effectively beyond 3 dimensions.

- Always scale the data before applying PCA.
- Because it is using the sum of distances within the data points, values can be artificially enlarged when we do not consider their units of measurement.

#### How do we choose the number of components for PCA?

- Doing additional analysis and determine which are the factors that contribute to the output. We can then indicate the number of components to use.
- Using a percentage within your PCA, such as:
`pca = PCA(n_components = 0.95)`

, where:- The algorithin will determine the minimal number of components that account for 95 percent variability of the data.

## Students Do: Segmenting with PCA

Let’s open up the file(s) in the `03-Stu_Segmenting_with_PCA`

folder to get started.

## Students Do: Energize Your Stock Clustering

Let’s open up the file(s) in the `04-Stu-Energize_Your_Stock_Clustering`

folder to get started.