K means¶
For simplicity I made my own dataset and to be precice with clustering model.
Loading Libraries and Dataset¶
In [1]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
In [2]:
dfx = pd.read_csv('workout.csv')
dfx.head()
Out[2]:
Standardizing the Attributes/Clustering variables¶
In [3]:
scaler = StandardScaler()
df = scaler.fit_transform(dfx)
The Number of Clusters¶
Using elbow method to determined the appropriate k value
In [4]:
clusters = []
inertias = []
# try for 9 k
for k in range(1,10):
km = KMeans(n_clusters=k, random_state=0)
km.fit(df)
# append clusters and inertias
inertias.append(km.inertia_)
clusters.append(km)
plt.plot(range(len(inertias)), inertias, '-o')
plt.xlabel('number of clusters (k)')
plt.ylabel('inertia')
Out[4]:
The the elbow of the curve is around K= 3. For values of K greater than 3, the distortion value starts decaying steadily.
Silhouette score¶
The Silhouette score is used to measure the degree of separation between clusters.
In [5]:
for i in range(1,9,1):
print("---------------------------------------")
print(clusters[i])
print("Silhouette score:",silhouette_score(df, clusters[i].predict(df)))
As we can see from silhoutte and elbow method the optimal number of clusters is 3.
In [6]:
# Kmeans object with 3 clusters
km = KMeans(n_clusters=3, random_state=0)
In [7]:
# fir and predict labels
y_km = km.fit_predict(df)
y_km
Out[7]:
As expected the first observation up to index 8 are all in the same category. We made it this way.
Visualization¶
Cetroid of each clusters¶
In [8]:
centers = km.cluster_centers_
centers
Out[8]:
Assigning each feature in a variable¶
In [12]:
# Make a scatter plot of distance and duration and using labels to define the colors
distance = df[:,0]
duration = df[:,1]
In [10]:
plt.scatter(distance, duration, c=y_km, alpha=0.5)
# replacing centroid by accessing 2d list (centers)
plt.scatter(centers[0][0], centers[0][1], marker='*', color ='r', s=150)
plt.scatter(centers[1][0], centers[1][1], marker='*', color ='r', s=150)
plt.scatter(centers[2][0], centers[2][1], marker='*', color ='r', s=150)
plt.xlabel('Distance(kilometers)')
plt.ylabel('Duration(minutes)')
Out[10]:
Note: our dataset has been standarized at this point
We can now labels each cluster like this;
- 1 (team_blue) = lazy
- 2 (team_yellow) = not so fast
- 3 (team_purple)= fast
Just to recap, these are the pros and cons of using K-Means:
pros
- Easy to implement
- Only has one parameter to tune and you can easily see the direct impact of adjusting the value of parameter K
cons
- Heavily affected by outliers
- Sensitive to random initialization
-fin