A Simple Look at a Simplifier
By the time you being learning about Principle Component Analysis (PCA) you will have had some practice organizing data and, in particular, handling data in 2 dimension tables. For example, something you have likely already worked on a housing price predictor project where, among other things, you needed to find the price per square foot or the result of x dollars by y square feet. This value, however, was only one of many factors that went to into your price prediction. While this project may not have had many features, it likely still demonstrated how quickly things can become complicated. You may even have already encountered the curse of dimensionality.
What is the Curse of Dimensionality and How Can PCA Help?
The curse of dimensionality can refer to many things but generally refers to a phenomenon that arises when analyzing and organizing data in high-dimensional spaces that causes more errors to occur thus making algorithms harder to design. In other words, the more features you have the more problems you are likely to need to account for. The problem of reducing the dimensionality of a dataset in a meaningful way shows up all over modern data analysis. This is exactly the problem PCA can help address.
Principle component analysis is a standard way to reduce the dimensions of a dataset to something more manageable. The goal of PCA is to identify patterns, correlations between variables, and then distill the variables down to their most important features so that the data is simplified without losing important traits. When reducing the number of variables of a dataset you will naturally be trading accuracy for simplicity. Smaller data sets are easier to explore and visualize, making analysis much easier and faster for machine learning algorithms to process. Additionally, algorithms are less prone to overfitting when the underlying data itself has first been compressed, reducing noise or other anomalies. This may still seem a bit abstract so let's take a look at an example of PCA in action.
First, you should determine if PCA is the right method for you. Do you want to reduce the number of variables, but can’t completely remove any from consideration? Do you want to ensure your variables are independent? Are you comfortable losing some interpretability of your independent variables? If you answered “yes” to all three questions, then PCA is for you.
For this example, I will be using a dataset about life expectancy with data collected from the WHO which can be found here. First I will start by importing necessary packages, preparing my data, and determining which features I will include in my PCA:
Next, I will need to standardize my data. This step aims to transform variables so that each one of them contributes equally to the analysis. This is critical because PCA is sensitive to the variances of the initial variables. If there are large differences between the ranges of initial variables, those variables with larger ranges will dominate over those with small ranges which will lead to biased results. Having scaled the data, I can then fit my data to the PCA algorithm:
Note the number of principal components corresponds to the number of features in my dataset. The next step will be to compute my eigenvectors and eigenvalues. Eigenvectors are the direction of the unit scaled vector in the p-dimensional space for the principal components. Eigenvalues are the magnitude of the variation in each of the components. Eigenvectors and eigenvalues are the linear algebra concepts that are needed to determine the principal components of the data. Principal components are the new variables that are constructed as linear combinations or mixtures from the initial variables. Using this data I will create a scree plot that will demonstrate how much variance is explained by each principal component:
From this plot, I can see that around 90% of this dataset's variability can be explained by ten principal components. Not only does this already reduce my variables down by half but it also implies that my data set is rather noisy. Nevertheless, I can dive a little deeper by looking at the loadings of each principal component. Loadings are the covariances/correlations between the original variables and the unit-scaled components. Their sums of squares within each component are the eigenvalues also known as components’ variances. Loadings are the coefficients in linear combination which describe a variable by the standardized components. For this example, I will look at the loadings for the first principal component:
From this, I can tell several things such as that schooling and life expectancy are highly correlated together. As well as that thinness in children appears to be inversely correlated with life expectancy. Looking at each principle component will likely help me derive even more information since we know that each PC is independent of the previous one. By analyzing the loadings I can begin to derive information for my data. Information that can then be used to direct the next steps of this or future studies. This is what principle component analysis is all about. Reducing a high-dimensional dataset to its principle components and then, as the name implies, analyzing them. This process can seem scary at first, but I hope by breaking it down to what it all means, you too can see this as one way to make complex data a bit more manageable.
Video Explanation of PCA from Statsquest
A tutorial on Principal Components Analysis from Lindsay I. Smith at the University of Otago
Here is another great Tutorial on Principal Component Analysis from Jon Shlens at UCSD
Everything you did and didn’t know about PCA, from the blog It's Neuronal.
Principal Component Analysis from Jeremy Kun’s blog
A One-Stop Shop for Principal Component Analysis from Matt Brems.