Principal Component Analysis

PCA

PCA

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
C:\Users\Toto\anaconda3\lib\site-packages\scipy\__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.1
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
In [2]:
# Read the csv data as a DataFrame
df = pd.read_csv('drybean.csv')
In [3]:
print(df.shape)
df.head()
(13611, 17)
Out[3]:
Area Perimeter MajorAxisLength MinorAxisLength AspectRation Eccentricity ConvexArea EquivDiameter Extent Solidity roundness Compactness ShapeFactor1 ShapeFactor2 ShapeFactor3 ShapeFactor4 Class
0 28395 610.291 208.178117 173.888747 1.197191 0.549812 28715 190.141097 0.763923 0.988856 0.958027 0.913358 0.007332 0.003147 0.834222 0.998724 SEKER
1 28734 638.018 200.524796 182.734419 1.097356 0.411785 29172 191.272751 0.783968 0.984986 0.887034 0.953861 0.006979 0.003564 0.909851 0.998430 SEKER
2 29380 624.110 212.826130 175.931143 1.209713 0.562727 29690 193.410904 0.778113 0.989559 0.947849 0.908774 0.007244 0.003048 0.825871 0.999066 SEKER
3 30008 645.884 210.557999 182.516516 1.153638 0.498616 30724 195.467062 0.782681 0.976696 0.903936 0.928329 0.007017 0.003215 0.861794 0.994199 SEKER
4 30140 620.134 201.847882 190.279279 1.060798 0.333680 30417 195.896503 0.773098 0.990893 0.984877 0.970516 0.006697 0.003665 0.941900 0.999166 SEKER

Note: Just for this analysis we drop rows and columns to get rid of too many data. This for better understanding of what happening in our dataset.

In [50]:
# Extract the numerical columns
X = df.iloc[:,:-1]
y = df['Class']

Using Numpy

First, we generate a correlation matrix using .corr() Next, we use np.linalg.eig() to perform eigendecompostition on the correlation matrix. This gives us two outputs — the eigenvalues and eigenvectors.

In [51]:
# Use the `.corr()` method on `data_matrix` to get the correlation matrix 
X_corr = X.corr()

Visualise fearutes correlations

In [6]:
# # Heatmap code:
# plt.figure(figsize=(15,12))
# # red_blue = sns.diverging_palette(220, 20, as_cmap=True)
# sns.heatmap(X_corr, annot = True)
# plt.show()

Next, we use np.linalg.eig() to perform eigendecompostition on the correlation matrix. This gives us two outputs — the eigenvalues and eigenvectors.

In [52]:
eigenvalues, eigenvectors = np.linalg.eig(X_corr)

After performing PCA, we generally want to know how useful the new features are. One way to visualize this is to create a scree plot, which shows the proportion of information described by each principal component.

In [65]:
# information proportion of each eigenvalue compared to the sum of all eigenvalues
info_prop = eigenvalues / eigenvalues.sum()
info_prop
Out[65]:
array([5.54664386e-01, 2.64309732e-01, 8.00656422e-02, 5.11408029e-02,
       2.73929290e-02, 1.14976093e-02, 6.97650724e-03, 3.25082500e-03,
       5.16266295e-04, 9.08681206e-05, 6.58867938e-05, 1.83739336e-05,
       9.29966038e-06, 6.25641678e-07, 1.11549485e-07, 1.34132086e-07])
In [56]:
## Plot of the principal axes vs the information proportions for each principal axis
plt.plot(np.arange(1,len(info_prop)+1),info_prop, 'bo-', linewidth=2)
plt.title('Scree Plot')
plt.xlabel('Principal Axes')
plt.xticks(np.arange(1,len(info_prop)+1))
plt.ylabel('Percent of Information Explained')
Out[56]:
Text(0, 0.5, 'Percent of Information Explained')

From this plot, we see that the first principal component explains about 55% of the variation in the data, the second explains about 26%, and so on.

Another way to view this is to see how many principal axes it takes to reach around 95% (our threshold) of the total amount of information.

In [62]:
# Cummulative sum of info_prop
cum_info_prop = np.cumsum(info_prop)
cum_info_prop
Out[62]:
array([0.55466439, 0.81897412, 0.89903976, 0.95018056, 0.97757349,
       0.9890711 , 0.99604761, 0.99929843, 0.9998147 , 0.99990557,
       0.99997146, 0.99998983, 0.99999913, 0.99999975, 0.99999987,
       1.        ])
In [64]:
# Plot the cumulative proportions array
plt.plot(cum_info_prop, 'bo-', linewidth=2)
plt.hlines(y=cum_info_prop[3], xmin=0, xmax=15)
plt.vlines(x=3, ymin=0, ymax=1)
plt.title('Cumulative Information percentages')
plt.xlabel('Principal Axes')
plt.xticks(np.arange(1,len(info_prop)+1))
plt.ylabel('Cumulative Proportion of Variance Explained')
Out[64]:
Text(0, 0.5, 'Cumulative Proportion of Variance Explained')

From this plot, we see that four principal axes account for about 95% of the variation in the data.

Using SKlearn

In [67]:
print(X.shape)
X.head(3)
(13611, 16)
Out[67]:
Area Perimeter MajorAxisLength MinorAxisLength AspectRation Eccentricity ConvexArea EquivDiameter Extent Solidity roundness Compactness ShapeFactor1 ShapeFactor2 ShapeFactor3 ShapeFactor4
0 28395 610.291 208.178117 173.888747 1.197191 0.549812 28715 190.141097 0.763923 0.988856 0.958027 0.913358 0.007332 0.003147 0.834222 0.998724
1 28734 638.018 200.524796 182.734419 1.097356 0.411785 29172 191.272751 0.783968 0.984986 0.887034 0.953861 0.006979 0.003564 0.909851 0.998430
2 29380 624.110 212.826130 175.931143 1.209713 0.562727 29690 193.410904 0.778113 0.989559 0.947849 0.908774 0.007244 0.003048 0.825871 0.999066

Standarize our data

In [69]:
mean = X.mean(axis=0)
sttd = X.std(axis=0)
X_standardized = (X - mean) / sttd
X_standardized.head(3)
Out[69]:
Area Perimeter MajorAxisLength MinorAxisLength AspectRation Eccentricity ConvexArea EquivDiameter Extent Solidity roundness Compactness ShapeFactor1 ShapeFactor2 ShapeFactor3 ShapeFactor4
0 -0.840718 -1.143277 -1.306550 -0.631130 -1.564995 -2.185640 -0.841420 -1.063302 0.289077 0.367600 1.423815 1.839049 0.680761 2.402084 1.925653 0.838340
1 -0.829157 -1.013887 -1.395860 -0.434429 -1.969712 -3.685904 -0.826071 -1.044178 0.697451 -0.462889 0.231046 2.495358 0.367953 3.100780 2.689603 0.771110
2 -0.807128 -1.078789 -1.252311 -0.585713 -1.514236 -2.045261 -0.808674 -1.008047 0.578174 0.518398 1.252819 1.764778 0.603107 2.235009 1.841288 0.916721

PCA object

In [70]:
pca = PCA()

Retrieve eigenvectors

In [71]:
# we can acces eigenvectors by using the components_ attribute
# note: were just fitting here and when were satisfied on the number of PCA's were gonna use will use the fit_transform method
components = pca.fit(X_standardized).components_
In [73]:
# convert to dataframe without transpose 
components_noTrans = pd.DataFrame(components, columns = X.columns)
components_noTrans.head(3)
Out[73]:
Area Perimeter MajorAxisLength MinorAxisLength AspectRation Eccentricity ConvexArea EquivDiameter Extent Solidity roundness Compactness ShapeFactor1 ShapeFactor2 ShapeFactor3 ShapeFactor4
0 0.282458 0.310891 0.325824 0.236199 0.229298 0.231526 0.283200 0.297484 -0.059808 -0.143016 -0.248165 -0.238378 -0.221319 -0.314625 -0.238983 -0.198009
1 0.245882 0.179303 0.100757 0.343461 -0.330844 -0.319434 0.244630 0.222802 0.220619 0.103322 0.214805 0.328914 -0.332549 0.129419 0.327522 0.100061
2 -0.061447 -0.018853 -0.084692 0.007500 -0.169058 -0.163042 -0.053649 -0.049914 -0.085258 -0.738670 -0.163325 0.149701 -0.032623 0.120077 0.149570 -0.536903
In [74]:
# convert to dataframe then transpose 
components = pd.DataFrame(components).transpose()

# attached the original data_matrix column names  as index for better veiwing
components.index =  X.columns
components.head(3)
Out[74]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Area 0.282458 0.245882 -0.061447 -0.031546 -0.091326 0.366390 0.125045 0.071748 0.035067 0.390420 -0.177686 0.054484 -0.046295 -0.655728 0.133190 0.231436
Perimeter 0.310891 0.179303 -0.018853 -0.042468 0.081820 0.010251 0.081530 0.031730 -0.157501 -0.344383 0.199454 -0.750550 -0.317920 -0.081390 0.012658 0.014614
MajorAxisLength 0.325824 0.100757 -0.084692 -0.006793 -0.044216 0.014909 0.118163 -0.200947 -0.352366 -0.101996 0.173640 0.027355 0.685302 0.186251 0.174432 0.346019

Retrieve propotional size of eigenvalue, the variance (information) ratios

In [79]:
# using explained_variance_ratio_ property
var_ratio = pca.explained_variance_ratio_
var_ratio
Out[79]:
array([5.54664386e-01, 2.64309732e-01, 8.00656422e-02, 5.11408029e-02,
       2.73929290e-02, 1.14976093e-02, 6.97650724e-03, 3.25082500e-03,
       5.16266295e-04, 9.08681206e-05, 6.58867938e-05, 1.83739336e-05,
       9.29966038e-06, 6.25641678e-07, 1.34132086e-07, 1.11549485e-07])
In [76]:
plt.plot(np.arange(1,len(var_ratio)+1),var_ratio, 'bo-', linewidth=2)
plt.title('Scree Plot')
plt.xlabel('Principal Axes')
plt.xticks(np.arange(1,len(var_ratio)+1))
plt.ylabel('Percent of Information Explained')
Out[76]:
Text(0, 0.5, 'Percent of Information Explained')
In [78]:
cummu_var_ratio = np.cumsum(var_ratio)
cummu_var_ratio
Out[78]:
array([0.55466439, 0.81897412, 0.89903976, 0.95018056, 0.97757349,
       0.9890711 , 0.99604761, 0.99929843, 0.9998147 , 0.99990557,
       0.99997146, 0.99998983, 0.99999913, 0.99999975, 0.99999989,
       1.        ])
In [81]:
# Plot the cumulative proportions array
plt.plot(cum_info_prop, 'bo-', linewidth=2)
plt.hlines(y=cummu_var_ratio[3], xmin=0, xmax=15)
plt.vlines(x=3, ymin=0, ymax=1)
plt.title('Cumulative Information percentages')
plt.xlabel('Principal Axes')
plt.xticks(np.arange(1,len(info_prop)+1))
plt.ylabel('Cumulative Proportion of Variance Explained')
Out[81]:
Text(0, 0.5, 'Cumulative Proportion of Variance Explained')
In [82]:
# covert to dataframe then tranpose
# transpose again to return back the rows data into columns data
var_ratio= pd.DataFrame(var_ratio).transpose()
var_ratio
Out[82]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0.554664 0.26431 0.080066 0.051141 0.027393 0.011498 0.006977 0.003251 0.000516 0.000091 0.000066 0.000018 0.000009 6.256417e-07 1.341321e-07 1.115495e-07

Let’s only keep the first four principal components because they account for 95% of the information in the data!

In [87]:
# only keep 4 PCs
pca = PCA(n_components = 4)
# fit and transform the data using the first 3 PCs
data_pcomp = pca.fit_transform(X_standardized)
# transform into dataframe
data_pcomp = pd.DataFrame(data_pcomp)
# rename columns
data_pcomp.columns = ['PC1', 'PC2', 'PC3', 'PC4']
print(data_pcomp)
            PC1       PC2       PC3       PC4
0     -4.981378  1.824630  0.748993 -0.390797
1     -5.436593  2.932257  2.182294 -0.431944
2     -4.757913  1.826817  0.514019 -0.125849
3     -4.300383  2.003587  3.554316  0.082961
4     -6.349107  4.088055  1.179156 -0.830327
...         ...       ...       ...       ...
13606 -1.125574 -0.441063 -0.875477 -0.719253
13607 -1.604952  0.495979 -0.840527  0.797404
13608 -1.417463  0.141189 -0.387192 -0.486421
13609 -1.114625 -0.212672  0.144083 -0.841872
13610 -0.766409 -0.646490 -0.994085  0.814649

[13611 rows x 4 columns]
In [88]:
## Plot the first two principal components colored by the bean classes
plt.figure(figsize=(15,20))
data_pcomp['bean_classes'] = y

sns.lmplot(x='PC1', y='PC2', data=data_pcomp, hue='bean_classes', fit_reg=False)
Out[88]:
<seaborn.axisgrid.FacetGrid at 0x1c5622dbb50>
<Figure size 1080x1440 with 0 Axes>
In [ ]:
 

Leave a Reply