PCA¶

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA

C:\Users\Toto\anaconda3\lib\site-packages\scipy\__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.1
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"

# Read the csv data as a DataFrame
df = pd.read_csv('drybean.csv')

print(df.shape)
df.head()

(13611, 17)

Note: Just for this analysis we drop rows and columns to get rid of too many data. This for better understanding of what happening in our dataset.¶

# Extract the numerical columns
X = df.iloc[:,:-1]
y = df['Class']

Using Numpy¶

First, we generate a correlation matrix using .corr() Next, we use np.linalg.eig() to perform eigendecompostition on the correlation matrix. This gives us two outputs — the eigenvalues and eigenvectors.

# Use the `.corr()` method on `data_matrix` to get the correlation matrix 
X_corr = X.corr()

Visualise fearutes correlations

# # Heatmap code:
# plt.figure(figsize=(15,12))
# # red_blue = sns.diverging_palette(220, 20, as_cmap=True)
# sns.heatmap(X_corr, annot = True)
# plt.show()

Next, we use np.linalg.eig() to perform eigendecompostition on the correlation matrix. This gives us two outputs — the eigenvalues and eigenvectors.

eigenvalues, eigenvectors = np.linalg.eig(X_corr)

After performing PCA, we generally want to know how useful the new features are. One way to visualize this is to create a scree plot, which shows the proportion of information described by each principal component.

# information proportion of each eigenvalue compared to the sum of all eigenvalues
info_prop = eigenvalues / eigenvalues.sum()
info_prop

array([5.54664386e-01, 2.64309732e-01, 8.00656422e-02, 5.11408029e-02,
       2.73929290e-02, 1.14976093e-02, 6.97650724e-03, 3.25082500e-03,
       5.16266295e-04, 9.08681206e-05, 6.58867938e-05, 1.83739336e-05,
       9.29966038e-06, 6.25641678e-07, 1.11549485e-07, 1.34132086e-07])

## Plot of the principal axes vs the information proportions for each principal axis
plt.plot(np.arange(1,len(info_prop)+1),info_prop, 'bo-', linewidth=2)
plt.title('Scree Plot')
plt.xlabel('Principal Axes')
plt.xticks(np.arange(1,len(info_prop)+1))
plt.ylabel('Percent of Information Explained')

Text(0, 0.5, 'Percent of Information Explained')

From this plot, we see that the first principal component explains about 55% of the variation in the data, the second explains about 26%, and so on.

Another way to view this is to see how many principal axes it takes to reach around 95% (our threshold) of the total amount of information.

# Cummulative sum of info_prop
cum_info_prop = np.cumsum(info_prop)
cum_info_prop

array([0.55466439, 0.81897412, 0.89903976, 0.95018056, 0.97757349,
       0.9890711 , 0.99604761, 0.99929843, 0.9998147 , 0.99990557,
       0.99997146, 0.99998983, 0.99999913, 0.99999975, 0.99999987,
       1.        ])

# Plot the cumulative proportions array
plt.plot(cum_info_prop, 'bo-', linewidth=2)
plt.hlines(y=cum_info_prop[3], xmin=0, xmax=15)
plt.vlines(x=3, ymin=0, ymax=1)
plt.title('Cumulative Information percentages')
plt.xlabel('Principal Axes')
plt.xticks(np.arange(1,len(info_prop)+1))
plt.ylabel('Cumulative Proportion of Variance Explained')

Text(0, 0.5, 'Cumulative Proportion of Variance Explained')

From this plot, we see that four principal axes account for about 95% of the variation in the data.

Using SKlearn¶

print(X.shape)
X.head(3)

(13611, 16)

Standarize our data¶

mean = X.mean(axis=0)
sttd = X.std(axis=0)
X_standardized = (X - mean) / sttd
X_standardized.head(3)

PCA object¶

pca = PCA()

Retrieve eigenvectors¶

# we can acces eigenvectors by using the components_ attribute
# note: were just fitting here and when were satisfied on the number of PCA's were gonna use will use the fit_transform method
components = pca.fit(X_standardized).components_

# convert to dataframe without transpose 
components_noTrans = pd.DataFrame(components, columns = X.columns)
components_noTrans.head(3)

# convert to dataframe then transpose 
components = pd.DataFrame(components).transpose()

# attached the original data_matrix column names  as index for better veiwing
components.index =  X.columns
components.head(3)

Retrieve propotional size of eigenvalue, the variance (information) ratios¶

# using explained_variance_ratio_ property
var_ratio = pca.explained_variance_ratio_
var_ratio

array([5.54664386e-01, 2.64309732e-01, 8.00656422e-02, 5.11408029e-02,
       2.73929290e-02, 1.14976093e-02, 6.97650724e-03, 3.25082500e-03,
       5.16266295e-04, 9.08681206e-05, 6.58867938e-05, 1.83739336e-05,
       9.29966038e-06, 6.25641678e-07, 1.34132086e-07, 1.11549485e-07])

plt.plot(np.arange(1,len(var_ratio)+1),var_ratio, 'bo-', linewidth=2)
plt.title('Scree Plot')
plt.xlabel('Principal Axes')
plt.xticks(np.arange(1,len(var_ratio)+1))
plt.ylabel('Percent of Information Explained')

Text(0, 0.5, 'Percent of Information Explained')

cummu_var_ratio = np.cumsum(var_ratio)
cummu_var_ratio

array([0.55466439, 0.81897412, 0.89903976, 0.95018056, 0.97757349,
       0.9890711 , 0.99604761, 0.99929843, 0.9998147 , 0.99990557,
       0.99997146, 0.99998983, 0.99999913, 0.99999975, 0.99999989,
       1.        ])

# Plot the cumulative proportions array
plt.plot(cum_info_prop, 'bo-', linewidth=2)
plt.hlines(y=cummu_var_ratio[3], xmin=0, xmax=15)
plt.vlines(x=3, ymin=0, ymax=1)
plt.title('Cumulative Information percentages')
plt.xlabel('Principal Axes')
plt.xticks(np.arange(1,len(info_prop)+1))
plt.ylabel('Cumulative Proportion of Variance Explained')

Text(0, 0.5, 'Cumulative Proportion of Variance Explained')

# covert to dataframe then tranpose
# transpose again to return back the rows data into columns data
var_ratio= pd.DataFrame(var_ratio).transpose()
var_ratio

Let’s only keep the first four principal components because they account for 95% of the information in the data!¶

# only keep 4 PCs
pca = PCA(n_components = 4)
# fit and transform the data using the first 3 PCs
data_pcomp = pca.fit_transform(X_standardized)
# transform into dataframe
data_pcomp = pd.DataFrame(data_pcomp)
# rename columns
data_pcomp.columns = ['PC1', 'PC2', 'PC3', 'PC4']
print(data_pcomp)

            PC1       PC2       PC3       PC4
0     -4.981378  1.824630  0.748993 -0.390797
1     -5.436593  2.932257  2.182294 -0.431944
2     -4.757913  1.826817  0.514019 -0.125849
3     -4.300383  2.003587  3.554316  0.082961
4     -6.349107  4.088055  1.179156 -0.830327
...         ...       ...       ...       ...
13606 -1.125574 -0.441063 -0.875477 -0.719253
13607 -1.604952  0.495979 -0.840527  0.797404
13608 -1.417463  0.141189 -0.387192 -0.486421
13609 -1.114625 -0.212672  0.144083 -0.841872
13610 -0.766409 -0.646490 -0.994085  0.814649

[13611 rows x 4 columns]

## Plot the first two principal components colored by the bean classes
plt.figure(figsize=(15,20))
data_pcomp['bean_classes'] = y

sns.lmplot(x='PC1', y='PC2', data=data_pcomp, hue='bean_classes', fit_reg=False)

<seaborn.axisgrid.FacetGrid at 0x1c5622dbb50>

<Figure size 1080x1440 with 0 Axes>

	Area	Perimeter	MajorAxisLength	MinorAxisLength	AspectRation	Eccentricity	ConvexArea	EquivDiameter	Extent	Solidity	roundness	Compactness	ShapeFactor1	ShapeFactor2	ShapeFactor3	ShapeFactor4	Class
0	28395	610.291	208.178117	173.888747	1.197191	0.549812	28715	190.141097	0.763923	0.988856	0.958027	0.913358	0.007332	0.003147	0.834222	0.998724	SEKER
1	28734	638.018	200.524796	182.734419	1.097356	0.411785	29172	191.272751	0.783968	0.984986	0.887034	0.953861	0.006979	0.003564	0.909851	0.998430	SEKER
2	29380	624.110	212.826130	175.931143	1.209713	0.562727	29690	193.410904	0.778113	0.989559	0.947849	0.908774	0.007244	0.003048	0.825871	0.999066	SEKER
3	30008	645.884	210.557999	182.516516	1.153638	0.498616	30724	195.467062	0.782681	0.976696	0.903936	0.928329	0.007017	0.003215	0.861794	0.994199	SEKER
4	30140	620.134	201.847882	190.279279	1.060798	0.333680	30417	195.896503	0.773098	0.990893	0.984877	0.970516	0.006697	0.003665	0.941900	0.999166	SEKER

	Area	Perimeter	MajorAxisLength	MinorAxisLength	AspectRation	Eccentricity	ConvexArea	EquivDiameter	Extent	Solidity	roundness	Compactness	ShapeFactor1	ShapeFactor2	ShapeFactor3	ShapeFactor4
0	28395	610.291	208.178117	173.888747	1.197191	0.549812	28715	190.141097	0.763923	0.988856	0.958027	0.913358	0.007332	0.003147	0.834222	0.998724
1	28734	638.018	200.524796	182.734419	1.097356	0.411785	29172	191.272751	0.783968	0.984986	0.887034	0.953861	0.006979	0.003564	0.909851	0.998430
2	29380	624.110	212.826130	175.931143	1.209713	0.562727	29690	193.410904	0.778113	0.989559	0.947849	0.908774	0.007244	0.003048	0.825871	0.999066

	Area	Perimeter	MajorAxisLength	MinorAxisLength	AspectRation	Eccentricity	ConvexArea	EquivDiameter	Extent	Solidity	roundness	Compactness	ShapeFactor1	ShapeFactor2	ShapeFactor3	ShapeFactor4
0	-0.840718	-1.143277	-1.306550	-0.631130	-1.564995	-2.185640	-0.841420	-1.063302	0.289077	0.367600	1.423815	1.839049	0.680761	2.402084	1.925653	0.838340
1	-0.829157	-1.013887	-1.395860	-0.434429	-1.969712	-3.685904	-0.826071	-1.044178	0.697451	-0.462889	0.231046	2.495358	0.367953	3.100780	2.689603	0.771110
2	-0.807128	-1.078789	-1.252311	-0.585713	-1.514236	-2.045261	-0.808674	-1.008047	0.578174	0.518398	1.252819	1.764778	0.603107	2.235009	1.841288	0.916721

	Area	Perimeter	MajorAxisLength	MinorAxisLength	AspectRation	Eccentricity	ConvexArea	EquivDiameter	Extent	Solidity	roundness	Compactness	ShapeFactor1	ShapeFactor2	ShapeFactor3	ShapeFactor4
0	0.282458	0.310891	0.325824	0.236199	0.229298	0.231526	0.283200	0.297484	-0.059808	-0.143016	-0.248165	-0.238378	-0.221319	-0.314625	-0.238983	-0.198009
1	0.245882	0.179303	0.100757	0.343461	-0.330844	-0.319434	0.244630	0.222802	0.220619	0.103322	0.214805	0.328914	-0.332549	0.129419	0.327522	0.100061
2	-0.061447	-0.018853	-0.084692	0.007500	-0.169058	-0.163042	-0.053649	-0.049914	-0.085258	-0.738670	-0.163325	0.149701	-0.032623	0.120077	0.149570	-0.536903

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
Area	0.282458	0.245882	-0.061447	-0.031546	-0.091326	0.366390	0.125045	0.071748	0.035067	0.390420	-0.177686	0.054484	-0.046295	-0.655728	0.133190	0.231436
Perimeter	0.310891	0.179303	-0.018853	-0.042468	0.081820	0.010251	0.081530	0.031730	-0.157501	-0.344383	0.199454	-0.750550	-0.317920	-0.081390	0.012658	0.014614
MajorAxisLength	0.325824	0.100757	-0.084692	-0.006793	-0.044216	0.014909	0.118163	-0.200947	-0.352366	-0.101996	0.173640	0.027355	0.685302	0.186251	0.174432	0.346019

Principal Component Analysis

PCA¶

Note: Just for this analysis we drop rows and columns to get rid of too many data. This for better understanding of what happening in our dataset.¶

Using Numpy¶

Using SKlearn¶

Standarize our data¶

PCA object¶

Retrieve eigenvectors¶

Retrieve propotional size of eigenvalue, the variance (information) ratios¶

Let’s only keep the first four principal components because they account for 95% of the information in the data!¶

Leave a Reply Cancel reply

PCA¶

Note: Just for this analysis we drop rows and columns to get rid of too many data. This for better understanding of what happening in our dataset.¶

Using Numpy¶

Using SKlearn¶

Standarize our data¶

PCA object¶

Retrieve eigenvectors¶

Retrieve propotional size of eigenvalue, the variance (information) ratios¶

Let’s only keep the first four principal components because they account for 95% of the information in the data!¶

You Might Also Like

Ridge and Lasso

K-Nearest Neighbor

Annova

Leave a Reply Cancel reply