PCA¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
# Read the csv data as a DataFrame
df = pd.read_csv('drybean.csv')
print(df.shape)
df.head()
Note: Just for this analysis we drop rows and columns to get rid of too many data. This for better understanding of what happening in our dataset.¶
# Extract the numerical columns
X = df.iloc[:,:-1]
y = df['Class']
Using Numpy¶
First, we generate a correlation matrix using .corr() Next, we use np.linalg.eig() to perform eigendecompostition on the correlation matrix. This gives us two outputs — the eigenvalues and eigenvectors.
# Use the `.corr()` method on `data_matrix` to get the correlation matrix
X_corr = X.corr()
Visualise fearutes correlations
# # Heatmap code:
# plt.figure(figsize=(15,12))
# # red_blue = sns.diverging_palette(220, 20, as_cmap=True)
# sns.heatmap(X_corr, annot = True)
# plt.show()
Next, we use np.linalg.eig() to perform eigendecompostition on the correlation matrix. This gives us two outputs — the eigenvalues and eigenvectors.
eigenvalues, eigenvectors = np.linalg.eig(X_corr)
After performing PCA, we generally want to know how useful the new features are. One way to visualize this is to create a scree plot, which shows the proportion of information described by each principal component.
# information proportion of each eigenvalue compared to the sum of all eigenvalues
info_prop = eigenvalues / eigenvalues.sum()
info_prop
## Plot of the principal axes vs the information proportions for each principal axis
plt.plot(np.arange(1,len(info_prop)+1),info_prop, 'bo-', linewidth=2)
plt.title('Scree Plot')
plt.xlabel('Principal Axes')
plt.xticks(np.arange(1,len(info_prop)+1))
plt.ylabel('Percent of Information Explained')
From this plot, we see that the first principal component explains about 55% of the variation in the data, the second explains about 26%, and so on.
Another way to view this is to see how many principal axes it takes to reach around 95% (our threshold) of the total amount of information.
# Cummulative sum of info_prop
cum_info_prop = np.cumsum(info_prop)
cum_info_prop
# Plot the cumulative proportions array
plt.plot(cum_info_prop, 'bo-', linewidth=2)
plt.hlines(y=cum_info_prop[3], xmin=0, xmax=15)
plt.vlines(x=3, ymin=0, ymax=1)
plt.title('Cumulative Information percentages')
plt.xlabel('Principal Axes')
plt.xticks(np.arange(1,len(info_prop)+1))
plt.ylabel('Cumulative Proportion of Variance Explained')
From this plot, we see that four principal axes account for about 95% of the variation in the data.
Using SKlearn¶
print(X.shape)
X.head(3)
Standarize our data¶
mean = X.mean(axis=0)
sttd = X.std(axis=0)
X_standardized = (X - mean) / sttd
X_standardized.head(3)
PCA object¶
pca = PCA()
Retrieve eigenvectors¶
# we can acces eigenvectors by using the components_ attribute
# note: were just fitting here and when were satisfied on the number of PCA's were gonna use will use the fit_transform method
components = pca.fit(X_standardized).components_
# convert to dataframe without transpose
components_noTrans = pd.DataFrame(components, columns = X.columns)
components_noTrans.head(3)
# convert to dataframe then transpose
components = pd.DataFrame(components).transpose()
# attached the original data_matrix column names as index for better veiwing
components.index = X.columns
components.head(3)
Retrieve propotional size of eigenvalue, the variance (information) ratios¶
# using explained_variance_ratio_ property
var_ratio = pca.explained_variance_ratio_
var_ratio
plt.plot(np.arange(1,len(var_ratio)+1),var_ratio, 'bo-', linewidth=2)
plt.title('Scree Plot')
plt.xlabel('Principal Axes')
plt.xticks(np.arange(1,len(var_ratio)+1))
plt.ylabel('Percent of Information Explained')
cummu_var_ratio = np.cumsum(var_ratio)
cummu_var_ratio
# Plot the cumulative proportions array
plt.plot(cum_info_prop, 'bo-', linewidth=2)
plt.hlines(y=cummu_var_ratio[3], xmin=0, xmax=15)
plt.vlines(x=3, ymin=0, ymax=1)
plt.title('Cumulative Information percentages')
plt.xlabel('Principal Axes')
plt.xticks(np.arange(1,len(info_prop)+1))
plt.ylabel('Cumulative Proportion of Variance Explained')
# covert to dataframe then tranpose
# transpose again to return back the rows data into columns data
var_ratio= pd.DataFrame(var_ratio).transpose()
var_ratio
Let’s only keep the first four principal components because they account for 95% of the information in the data!¶
# only keep 4 PCs
pca = PCA(n_components = 4)
# fit and transform the data using the first 3 PCs
data_pcomp = pca.fit_transform(X_standardized)
# transform into dataframe
data_pcomp = pd.DataFrame(data_pcomp)
# rename columns
data_pcomp.columns = ['PC1', 'PC2', 'PC3', 'PC4']
print(data_pcomp)
## Plot the first two principal components colored by the bean classes
plt.figure(figsize=(15,20))
data_pcomp['bean_classes'] = y
sns.lmplot(x='PC1', y='PC2', data=data_pcomp, hue='bean_classes', fit_reg=False)