Garments Production¶

This dataset includes important attributes of the garment manufacturing process and the productivity of the employees which had been collected manually and also been validated by the industry experts.

Attribute Information:

date : Date in MM-DD-YYYY
day : Day of the Week
quarter : A portion of the month. A month was divided into four quarters
department : Associated department with the instance
team : Associated team number with the instance
no_of_workers : Number of workers in each team
no_of_style_change : Number of changes in the style of a particular product
targeted_productivity : Targeted productivity set by the Authority for each team for each day.
smv : Standard Minute Value, it is the allocated time for a task
wip : Work in progress. Includes the number of unfinished items for products
overtime : Represents the amount of overtime by each team in minutes
incentive : Represents the amount of financial incentive (in BDT) that enables or motivates a particular course of action.
idletime : The amount of time when the production was interrupted due to several reasons
idlemen : The number of workers who were idle due to production interruption
actual_productivity : The actual % of productivity that was delivered by the workers. It ranges from 0-1.

Load Libraries and Dataset¶

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from scipy.stats import pearsonr

df = pd.read_csv('garments_worker_productivity.csv')
print(df.shape)
df.head()

(1197, 15)

len(df[df['over_time']==0])

31

EDA¶

Inspect, Clean, and Validate. Show underlying patterns and relationships within datasets.

Dataset general information¶

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1197 entries, 0 to 1196
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   date                   1197 non-null   object 
 1   quarter                1197 non-null   object 
 2   department             1197 non-null   object 
 3   day                    1197 non-null   object 
 4   team                   1197 non-null   int64  
 5   targeted_productivity  1197 non-null   float64
 6   smv                    1197 non-null   float64
 7   wip                    691 non-null    float64
 8   over_time              1197 non-null   int64  
 9   incentive              1197 non-null   int64  
 10  idle_time              1197 non-null   float64
 11  idle_men               1197 non-null   int64  
 12  no_of_style_change     1197 non-null   int64  
 13  no_of_workers          1197 non-null   float64
 14  actual_productivity    1197 non-null   float64
dtypes: float64(6), int64(5), object(4)
memory usage: 140.4+ KB

Rename department column wrong typo’s and convert to categorical dtypes¶

df.department.value_counts()

sweing        691
finishing     257
finishing     249
Name: department, dtype: int64

df['department'] = df['department'].replace(['sweing'], 'sewing')
df['department'] = df['department'].replace(['finishing '], 'finishing')
df['department'] = df.department.astype('category')

df.department.value_counts()

sewing       691
finishing    506
Name: department, dtype: int64

Duplicates¶

# variable that store duplicates exactly the same as another row
dups = df.duplicated()
# count duplicates
dups.value_counts()

False    1197
dtype: int64

Theres no duplicates.

Clean the Data (Null Values)¶

Check for the frequency of nulls in all the columns

df.isna().sum()

date                       0
quarter                    0
department                 0
day                        0
team                       0
targeted_productivity      0
smv                        0
wip                      506
over_time                  0
incentive                  0
idle_time                  0
idle_men                   0
no_of_style_change         0
no_of_workers              0
actual_productivity        0
dtype: int64

Theres a large proportion of null values in wip(42%). The first approach is to investigate how this data was collected or structured.

We need to undestand what type of missing data we are dealing:

Structurally missing data
Missing Completely at Random (MCAR)
Missing at Random (MAR)
Missing Not at Random (MNAR)

Check the shape of finishing and sewing department seperately.

finishing = df[df['department']=='finishing']
sewing = df[df['department']=='sewing']

print(finishing.shape)
sewing.shape

(506, 15)

(691, 15)

Check the total number of nulls in wip.

finishing.wip.isna().sum()

506

Nova! all of the NaN entries are comming from finishing department. What we need to so now is to ask and verify if there is really no pending works(wip) in the finishing department, just to be sure. Since there is no one to ask for this project I will conclude that this missing data is labeled as Structurally Missing thus there is no wip in the finishing department.

Now were gonna impute these NaN values. We have a lot of techniques in imputing Nan’s (ex. time series LOCF and NOCB, pairwise, listwise etc.) in this scenario we will convert these NaN in to zero.

# Converting wip NaN values to zero
df.wip.fillna(0, inplace=True)

Visualize the null datapoints¶

We convert first all nulls into 0 before plotting thats because matplot library does’nt read nulls so if we plot our variables without converting, null will not show in the plot.

# Code preparation for plotting.
# Mapping departent to 0 and 1 as integer 
df['dept_mapped'] = df.department.map({
                                'finishing':0,
                                'sewing':1})
# convert to int dtype
df.dept_mapped  = df.dept_mapped.astype('int64')

# plt.figure(figsize=(1, 1)) 
g = sns.lmplot(x='dept_mapped', y='wip', 
               data=df, 
               x_jitter=0.15, 
               y_jitter=0.15, 
               fit_reg=False, 
               scatter_kws={'alpha':0.2})

g.set(xticks=range(2))
g.set_xticklabels(['finishing', 'sewing'])
plt.show()

Convert date to datetime dtype¶

# date format is MM-DD-YYYY
df['date'] = pd.to_datetime(df.date, format="%m/%d/%Y")

Augment data with additional columns(add month and work_week)

# add month
df['month'] = df['date'].dt.month

# Adding `work_week` columns. It's easier to analyze production data per work week.
# create a list of our conditions
conditions = [
    
    #January
    (df['month'] == 1) & (df['quarter'] == 'Quarter1'), #ww1
    (df['month'] == 1) & (df['quarter'] == 'Quarter2'), #ww2
    (df['month'] == 1) & (df['quarter'] == 'Quarter3'), #ww3
    (df['month'] == 1) & (df['quarter'] == 'Quarter4'), #ww4
    (df['month'] == 1) & (df['quarter'] == 'Quarter5'), #ww4
    
    #February
    (df['month'] == 2) & (df['quarter'] == 'Quarter1'), #ww5
    (df['month'] == 2) & (df['quarter'] == 'Quarter2'), #ww6
    (df['month'] == 2) & (df['quarter'] == 'Quarter3'), #ww7
    (df['month'] == 2) & (df['quarter'] == 'Quarter4'), #ww8
    (df['month'] == 2) & (df['quarter'] == 'Quarter5'), #ww8
    
    #March
    (df['month'] == 3) & (df['quarter'] == 'Quarter1'), #ww9
    (df['month'] == 3) & (df['quarter'] == 'Quarter2'), #ww_10
    (df['month'] == 3) & (df['quarter'] == 'Quarter3'), #ww_11
    (df['month'] == 3) & (df['quarter'] == 'Quarter4'), #ww_12
    (df['month'] == 3) & (df['quarter'] == 'Quarter5'), #ww_12
    ]

values = ['ww1','ww2','ww3','ww4','ww4',
          'ww5','ww6','ww7','ww8','ww8',
          'ww9','ww_10','ww_11','ww_12','ww_12']

df['work_week'] = np.select(conditions, values)

Quater column to categorical dtypes¶

quarter_order = ['Quarter1', 'Quarter2', 'Quarter3', 'Quarter4', 'Quarter5' ]
df.quarter = pd.Categorical(df.quarter, quarter_order, ordered=True)
# verify quarter order
df.quarter.unique()

['Quarter1', 'Quarter2', 'Quarter3', 'Quarter4', 'Quarter5']
Categories (5, object): ['Quarter1' < 'Quarter2' < 'Quarter3' < 'Quarter4' < 'Quarter5']

Day column to categorical dtypes¶

df.day.unique()

array(['Thursday', 'Saturday', 'Sunday', 'Monday', 'Tuesday', 'Wednesday'],
      dtype=object)

day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Saturday', 'Sunday' ]
df.day = pd.Categorical(df.day, day_order, ordered=True)
# verify day order
df.day.unique()

['Thursday', 'Saturday', 'Sunday', 'Monday', 'Tuesday', 'Wednesday']
Categories (6, object): ['Monday' < 'Tuesday' < 'Wednesday' < 'Thursday' < 'Saturday' < 'Sunday']

Inspect zero value (60% and up)¶

Let’s investigate this zero values.

# how many zeros are there in each columns?
# (df == 0).sum(axis=0) # uncomment to show result

print('Proportion of zeros: ')
(df == 0).sum(axis=0)  / len(df) * 100

Proportion of zeros:

date                      0.000000
quarter                   0.000000
department                0.000000
day                       0.000000
team                      0.000000
targeted_productivity     0.000000
smv                       0.000000
wip                      42.272348
over_time                 2.589808
incentive                50.459482
idle_time                98.496241
idle_men                 98.496241
no_of_style_change       87.719298
no_of_workers             0.000000
actual_productivity       0.000000
dept_mapped              42.272348
month                     0.000000
work_week                 0.000000
dtype: float64

We have a very large proportion of zero value in wip, incentive, idle_time, idle_men, and no_of_style_change.

wip This 42% of zeros are came from finishing department alternately the remaining 58% are work in progress by the sewing department.
incentive 50% or half of the total production days got incentive
idle_time This large proportion is a good indication that the the production is running smoothly.
idle_men Same as idle_time, most of the teams/workers are working deligently. 98% are zero value.
no_of_style_change 87.72% are zero values. Seems fair there is only 12% of change of product style that caused idle_time. Changing style might result to idle_time if not properly executed.

Feature Distribution¶

After doing some cleaning and data wrangling, we can now plot the histogram of the dataset for inspection.

# we will not include date in this visualization
df.drop(columns=['date']).hist(figsize=(10, 7))
plt.tight_layout()
plt.show()

Skewness¶

Skewness is a measure of asymmetry of a distribution. Important Notes:

between -0.5 and 0.5, fairly symmetrical
between -1 and -0.5 or between 0.5 and 1, moderately skewed
less than -1 or greater than 1, highly skewed

# select numeric columns only
df_numeric = df.select_dtypes(include=np.number).columns.tolist()
for i in df_numeric:
    print(i + ' ==> ' + str(round(df[i].skew(),2)))

team ==> 0.01
targeted_productivity ==> -2.14
smv ==> 0.41
wip ==> 10.85
over_time ==> 0.67
incentive ==> 15.79
idle_time ==> 20.55
idle_men ==> 9.86
no_of_style_change ==> 2.94
no_of_workers ==> -0.11
actual_productivity ==> -0.81
dept_mapped ==> -0.31
month ==> 0.49

Kurtosis()¶

kurtosis determines the heaviness of the distribution tails. Determine the volume of the outlier.

If the distribution is tall and thin it is called a leptokurtic distribution(Kurtosis > 3). Values in a leptokurtic distribution are near the mean or at the extremes.
A flat distribution where the values are moderately spread out is called platykurtic(Kurtosis <3) distribution.
A distribution whose shape is in between a leptokurtic distribution and a platykurtic distribution is called a mesokurtic(Kurtosis=3) distribution. A mesokurtic distribution looks more close to a normal distribution.

‘Note’

High kurtosis in a df set is an indicator that df has heavy outliers.
Low kurtosis in a df set is an indicator that df has lack of outliers.

df.select_dtypes(include=np.number).kurt()

team                      -1.223906
targeted_productivity      5.613701
smv                       -0.795346
wip                      141.332609
over_time                  0.424364
incentive                299.032462
idle_time                442.638160
idle_men                 102.962869
no_of_style_change         8.181490
no_of_workers             -1.788108
actual_productivity        0.333227
dept_mapped               -1.905055
month                     -1.057430
dtype: float64

Statistical Analysis¶

1. `team`¶

Fairly distributed and flat. We will group the dataset by team then get the average productivity.

Whats the best team so far?

note:

Productivity is scaled from 0-1(0-100%)
Mean is more preferable than median if the distribution is fairly symmetrical.

team = df.groupby('team').mean() # groupby then get the mean
team['diff_in_productivity'] = team.actual_productivity - team.targeted_productivity

# limit to 2 decimal places
team = team.round(2)

team[['targeted_productivity', 'actual_productivity', 'diff_in_productivity']]

Top 3 Productive Team

team.actual_productivity.sort_values(ascending=False).head(3)

team
1     0.82
3     0.80
12    0.78
Name: actual_productivity, dtype: float64

Visualization

#
x = team.index
y = team.actual_productivity

plt.figure(figsize=(8,5))
ax1 = plt.subplot()
sns.barplot(x=x, y=y, palette = 'deep')
plt.title('Productivity by Team')

for i in ax1.containers:
    ax1.bar_label(i,)

plt.tight_layout()    
plt.show()

Team productivity by department

team_dept = df.groupby(['team','department']).mean() 
team_dept['diff_productivity'] = team_dept.actual_productivity - team_dept.targeted_productivity
team_dept = team_dept.round(2)

# Show first 3 teams
team_dept[['targeted_productivity', 'actual_productivity', 'diff_productivity']].head(6)

Visualization

# code preparation for visualization
team_act_prod = []
for i in team_dept.actual_productivity:
    team_act_prod.append(i)

sewing = team_act_prod[1::2]
finishing= team_act_prod[0::2]

# team_dept is a 2d list
# example code for accessing 2d indeces
# team 2 finishing dept would be:
# team_dept.actual_productivity[2][0]

#
fig, ax = plt.subplots(figsize=(12,6))

x = np.arange(1,13)
width = 0.35  # the width of the bars

rects1 = ax.bar(x - width/2, finishing, width, label= 'finishing')
rects2 = ax.bar(x + width/2, sewing, width, label='sewing')
ax.set_xticks(x)
ax.set_xlabel('Teams')
ax.set_ylabel('Actual Productivity')

def autolabel(rects):
    """Attach a text label above each bar in *rects*, displaying its height."""
    for rect in rects:
        height = rect.get_height()
        ax.annotate('{}'.format(height),
                    xy=(rect.get_x() + rect.get_width() / 2, height),
                    xytext=(0, 3),  # 3 points vertical offset
                    textcoords="offset points",
                    ha='center', va='bottom')

plt.title('Actual Productivity by Team Department')
autolabel(rects1)
autolabel(rects2)
plt.tight_layout()
plt.legend(loc='upper center')
plt.show()

Hypothesis¶

Two T-test

Comapare team 1 and 3 since their productivity score are pretty close.

null: There is no significant difference between team1 and team3 with respect to their productivity.
alt: There is significant difference between team1 and team3.

from scipy.stats import ttest_ind

team1 = df[df.team == 1]
team3 = df[df.team == 3]

plt.hist(team1.actual_productivity, alpha=0.5, label='team1')
plt.hist(team3.actual_productivity,alpha=0.5, label='team3')
plt.legend()
plt.show()

# Check if STD is equal
# a ratio between 0.9 and 1.1 should suffice
ratio = np.std(team1.actual_productivity) / np.std(team3.actual_productivity)
ratio

1.0846997401445262

#run the t-test here:
tstat, pval =ttest_ind(team1.actual_productivity, team3.actual_productivity)
pval

0.41425379288460873

Productivity between team1 and team3 has no significant difference.

2. `target productivity`¶

Heavily left skewed. This is great, target productivy should have more high values.

Predicted vs. Actual Productivity

Note

We group our dataset by work_week then we get the mean. Anternativily we can use the median if the distribution of particular work_week is in heavily skewed shape. For simplicity we just use the mean. I already check the overall distribution (ww1-ww10) and it’s in fairly bell shaped.

print('Average Productivity per work_week')
target_actual = df.groupby('work_week').mean()
target_actual[['targeted_productivity', 'actual_productivity']]

Average Productivity per work_week

Visualization

# setting variables
# target productivity average per workweek
tp_avg = df.groupby('work_week')['targeted_productivity'].mean()
# actual productivity average per workweek
ap_avg = df.groupby('work_week')['actual_productivity'].mean()

plt.figure(figsize=(10,5))

plt.plot(tp_avg.index, tp_avg, marker='o')
plt.plot(ap_avg.index, ap_avg, marker='o')

plt.xlabel('work weeks')
plt.ylabel('productivity score')
plt.legend(['targeted productivity', 'actual productivity'])

plt.show()

We have a very good ww4 and ww5 however after ww5 our productivity crashed to a very low level.

# Checking the reason behind the crash
idle = df.groupby('work_week').sum()
idle['idle_time']

work_week
ww1        0.0
ww2        0.0
ww3        0.0
ww4        0.0
ww5      810.0
ww6        0.0
ww7       38.0
ww8        8.0
ww9       18.0
ww_10      0.0
Name: idle_time, dtype: float64

We have a major downtime during ww5 that caused the fall of actual productivity.

3. `smv`¶

Distribution is faily symmetrical

Check if theres variables correlated to smv(ex. Is smv correlate to over_time?

def correlation_checker(col):
    plt.figure(figsize=(6, 6)) 
    corr_matrix = df.corr()
 
    corr_target = corr_matrix[[col]].drop(labels=[col])
    corr_target = corr_target.sort_values(by=[col], ascending=False)

    sns.heatmap(corr_target, annot=True, fmt='.3', cmap='RdBu_r')
    plt.show()

correlation_checker('smv')

There are three feature’s that has high correlation to smv. However I’m more interested in no_of_workers and over_time, let’s do scatter plot these features againts smv.

df_pairplot = df[['smv', 'no_of_workers', 'over_time' ]]

sns.pairplot(df_pairplot)
plt.show()

As smv(allocated time for a task) increases, the number of workers and overtime also increase. Increasing No_of_workers and over_time with respect to smv is to ensure that production will finish the task on time.

4. `wip`¶

Heavily right skewed, as seen in above graph most of the datapoints are at zero and only few has value. Kurtosis at 110 indicating that the distribution is very tall(the mode has a very high proportion).

wip are task that does’nt finish yet, we can view this as delay. Let’s check for what work_week has the most wip.

# sum all wip per work_week
df_wip = df.groupby('work_week').sum()
df_wip['wip']

work_week
ww1       67385.0
ww2       66903.0
ww3       79075.0
ww4      114889.0
ww5      205579.0
ww6       61737.0
ww7       53760.0
ww8       65084.0
ww9       60416.0
ww_10     47784.0
Name: wip, dtype: float64

plt.figure(figsize=(8,5))
ax1 = plt.subplot()
sns.barplot(x= df_wip.index ,
            y= df_wip['wip'],
            palette = 'deep')

plt.title('Work in Progress')
for i in ax1.containers:
    ax1.bar_label(i,)

plt.tight_layout()    
plt.show()

We have a large work in progress last ww5. This is expected as we already see that there is a major idle_time(downtime) in ww5. Wonder what really happened in ww5.

5. `over_time`¶

Moderately right skewed. Also a good sign, we must lessen OT while maintenaning high productivity.

How much over_time does production has per work_week? Also the number of workers?

# set variables
ww = df.groupby('work_week').sum() # we use groupby then sum
ww['cummulative'] = ww.over_time.cumsum()
ww[['no_of_workers', 'over_time', 'cummulative']]

There are 642450 overtime minutes from work week1 with 4305 workers. An average of 2.48 OT hours per worker . (642450 / 4305 we will get 149 minutes then we divide by 60 to get 2.48 hours per worker)

Visualization

#
plt.figure(figsize=(14, 6)) 

# set variables
x = ww.index
y = ww.over_time

# plot1
# sum of overtime per work_week
ax1 = plt.subplot(1,2,1)
sns.barplot(x = x, y = y, data=df)
ax1.set_xticklabels(ax1.get_xticklabels(),rotation = 80)
plt.title('Overtime per Work Week ')

for i in ax1.containers:
    ax1.bar_label(i,)


#plot2
# cumulative sum of work_week
ax2 = plt.subplot(1,2,2)
sns.lineplot(x=x, y=ww.over_time.cumsum(), marker='o', )
plt.title('Cumulative Overtime')

plt.tight_layout()
plt.show()

6. `incentive`¶

Heavily right skewed, most observation don’t have incentives. Kurtosis at very high value(299) indicating that there is a large proportion of single observation(the zero value).

Is this high value incentives has correlation with actual_productivy?
Take note, we should do an investigation with the outliers.

from scipy.stats import pearsonr
corr, p = pearsonr(df.incentive, df.actual_productivity)
corr, p

(0.07653762727386497, 0.008069571804677655)

plt.figure(figsize=(10, 5)) 

ax1 = plt.subplot(1,2,1)
ax1 =  sns.regplot(x=df.actual_productivity, y=df.incentive, scatter_kws={'alpha':0.3})

# Zoom in 
ax2 = plt.subplot(1,2,2)
plt.title('Zoom in')
plt.scatter(df.actual_productivity, df.incentive, alpha=0.3)
plt.ylim(0,200)
plt.xlabel('actual_productivity')
plt.ylabel('incentive')

plt.show()

Pearson score is in nearly zero which mean that there’s no correlation. We have 50% zero value from incentive column that affecting our analyzation.

Most of the incentive that has value are from between 25 and 150, but how did we get a value of around 3500? Thats is for further investigation that we cant’ solve in this given dataset.

Without these zero values and outliers I think we could get a good meaning from this data.¶

Note:

Large proportion of any values that does’nt have variation does’nt give us insight.
50% are zero value from incentive (large proportion of zero values does’nt give much meaning in our analyzation).
There are 10(1.69%) outliers from total number of incentives that has value
These outliers happened in just one day.(see below)

Pairwise deletion/repalcement¶

Delete or replace only the column that we are working on. Visualization does not read NaN values, we can now plot again without these zeros.

import warnings
warnings.filterwarnings('ignore')


# replace zero value in incentive w/ np.nan
df.incentive = df['incentive'].replace(0,np.nan)

# drop np.nan values
df_x = df.dropna(subset=['incentive'], axis=0)

# impute outliers with mean value
df_x['incentive'] = df_x['incentive'].where(df_x['incentive'] < 500, 
                                            df_x.incentive.mean())

corr, p = pearsonr(df_x.incentive, df_x.actual_productivity)
corr

0.7786551857462339

plt.figure(figsize=(10, 5)) 

ax1 = plt.subplot(1,2,1)
ax1 =  sns.regplot(x=df_x.actual_productivity, y=df_x.incentive, scatter_kws={'alpha':0.3})

# Zoom in 
ax2 = plt.subplot(1,2,2)
plt.title('Zoom in')
plt.scatter(df_x.actual_productivity, df_x.incentive, alpha=0.3)
# plt.xlim(0,200)
plt.xlabel('actual_productivity')
plt.ylabel('incentive')

plt.show()

Without the zero values and outliers, we got a meaningful insight.

Correlation is now at .778 suggesting that incentive and actual_productivity has a direct relationship to each other.

`Follow up question:`¶

When this extremely high incentive happened?

# check details for this extreme incentives
df[df['incentive']>2500]

This extreme incentives happened on ww_10 particularly in the finishing department

ww10 = df[
          (df.work_week == 'ww_10') & 
          (df.day == 'Monday') &
          (df.department == 'finishing')
         ] 

ww10.sort_values(by='team')

Check the count of these extreme incentives if equal to the number of teams who received incentive last ww10 dated 2015-03-09.

len(df[df['incentive']>500])

10

Choomss!! Our suspesion are correct! is there any special event in this day 2015-03-09, management are giving huge incentives.

7. `no_of_workers`¶

Graph looks binomial, is there two group/kind/types of workers?

# import sklean library
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import warnings

Check what features has strong correlation to no_of_workers.

correlation_checker('no_of_workers')

# subset of our dataset
df_cluster = df[['no_of_workers', 'smv']]

# Normalise our data set
scaler = StandardScaler()
df_cluster = scaler.fit_transform(df_cluster)

Finding the best `k` value¶

A good model is one with low inertia and a low number of clusters (K)

# hide/show code
clusters = []
inertias = []

# Checking what is the best 'k' value
# try for 9 
for k in range(1,10):
  km = KMeans(n_clusters=k, random_state=0)
  km.fit(df_cluster)
  
  # append clusters and inertias
  inertias.append(km.inertia_)
  clusters.append(km)
  
plt.plot(range(len(inertias)), inertias, '-o')
plt.xlabel('number of clusters (k)')
plt.ylabel('inertia')
plt.show()

Check for silhoutte score¶

for i in range(1,9,1): 
    print("---------------------------------------")
    print(clusters[i])
    print("Silhouette score:",silhouette_score(df_cluster, clusters[i].predict(df_cluster)))

---------------------------------------
KMeans(n_clusters=2, random_state=0)
Silhouette score: 0.7567741168682008
---------------------------------------
KMeans(n_clusters=3, random_state=0)
Silhouette score: 0.7015783380024647
---------------------------------------
KMeans(n_clusters=4, random_state=0)
Silhouette score: 0.6821538274382501
---------------------------------------
KMeans(n_clusters=5, random_state=0)
Silhouette score: 0.701341422245828
---------------------------------------
KMeans(n_clusters=6, random_state=0)
Silhouette score: 0.7306969976590627
---------------------------------------
KMeans(n_clusters=7, random_state=0)
Silhouette score: 0.6648649287636839
---------------------------------------
KMeans(random_state=0)
Silhouette score: 0.6699003925114578
---------------------------------------
KMeans(n_clusters=9, random_state=0)
Silhouette score: 0.6444164647470878

2 clusters seem pretty good. Let’s apply clustering using k=2.

Kmeans with 2 clusters¶

km = KMeans(n_clusters=2, random_state=0)

# fit and predict labels
y_km = km.fit_predict(df_cluster)
y_km

array([0, 1, 1, ..., 1, 1, 1])

Assigning variables¶

# Note:
# first column is the no_of_workers
# 2nd column is the smv
df_cluster[0:5]

array([[ 1.0992288 ,  1.01455214],
       [-1.19926822, -1.01677766],
       [-0.18522542, -0.33387786],
       [-0.18522542, -0.33387786],
       [ 0.96402309,  0.99078321]])

no_of_workers = df_cluster[:,0]
smv = df_cluster[:,1]

# first 5 records
print(no_of_workers[0:5])
print(smv[0:5])

[ 1.0992288  -1.19926822 -0.18522542 -0.18522542  0.96402309]
[ 1.01455214 -1.01677766 -0.33387786 -0.33387786  0.99078321]

Setting up centroids¶

Center points of each clusters.

centers = km.cluster_centers_
centers

array([[ 0.8619967 ,  0.8127223 ],
       [-1.03123275, -0.9722843 ]])

Visualized¶

plt.rcParams['axes.facecolor'] = 'gray'

plt.figure(figsize=(8,5))
plt.scatter(no_of_workers, smv, c=y_km, alpha=0.5)

plt.scatter(centers[0][0], centers[0][1], marker='*', color ='purple', s=300, label ='purple centroid (0)', edgecolor = 'black')
plt.scatter(centers[1][0], centers[1][1], marker='*', color ='yellow', s=300, label ='blue centroid (1)', edgecolor = 'black')

plt.xlabel('number of workers')
plt.ylabel('smv(allocated time for a task)')
plt.legend()
plt.title('Scaled clustering')
plt.show()

Scatter plot again but this time we will revert back to un_scaled value

# turn our predicted value first into dataframe
group_worker = pd.DataFrame(y_km)

# concatenate, assign it to a new variable
df_new = pd.concat([df, group_worker], axis=1, join='inner')

# # rename column
df_new = df_new.rename(columns={0: 'group_worker'})

plt.rcParams['axes.facecolor'] = 'gray'

sns.scatterplot(x= 'no_of_workers', y= 'smv', data=df_new, hue='group_worker', palette=['purple','yellow'])
plt.show()
plt.clf

<function matplotlib.pyplot.clf()>

We have two groups to choose from depending on smv value.

yellow = low number of worker needed
purple = more worker needed

smv 0 to 10 = (yellow, need up to 30 workers)
smv 11 to 50 = (purple, need 30 up to 60 workers)

8. no_of_style_change¶

Does no_of_style_change produce idle_time?

Frequency of unique value in no_of_style_change

df.no_of_style_change.value_counts()

0    1050
1     114
2      33
Name: no_of_style_change, dtype: int64

Proportion

import plotly.express as px
fig = px.pie(df, 'no_of_style_change')

fig.update_layout(title="Proportion of no_of_style_change")
fig.update_traces(textposition='inside',
                  textinfo='percent+label', showlegend=True)

fig.update_traces(pull= 0.02)
fig.show('png')

Correlation between no_of_style_change and idle_time.

from scipy.stats import pearsonr
corr, p = pearsonr(df.no_of_style_change, df.idle_time)
corr, p

(-0.011597866663013127, 0.6885278961213044)

There’s no significant correlation between no_of_style_change and idle_time. Also the pval is at very high value.

# visualize
plt.rcParams['axes.facecolor'] = 'white'
# sns.set(rc={'figure.figsize':(8,8)})

ax1 = sns.lmplot(x='no_of_style_change', y='idle_time', 
               data=df, 
               x_jitter=0.15, 
               y_jitter=0.15, 
               fit_reg=False, 
               scatter_kws={'alpha':0.5})

ax1.set(xticks=range(3))
ax1.set_xticklabels(df.no_of_style_change.unique())
ax1.set(ylim=(0,10))

# 2nd plot
ax2 = sns.lmplot(x='no_of_style_change', y='idle_time', 
               data=df, 
               x_jitter=0.15, 
               y_jitter=0.15, 
               fit_reg=False, 
               scatter_kws={'alpha':0.5})

ax2.set(xticks=range(3))
ax2.set_xticklabels(df.no_of_style_change.unique())
ax2.set(ylim=(0,.25))
plt.title('ZOOM IN')

plt.show()

How many idle_time does the production had?

change_style_idle = df[df['idle_time']>=0]

# we use count because we need to determine how many idle_time does the production had that is >= 2
change_style_idle.groupby('no_of_style_change').count()[['idle_time']]

There are 114 for one change in style and 33 count for changing style two times.

Changing style can cause idle time up from 15 seconds up to 8 minutes.

10. Prodcutivity per month¶

jan = df[df['month']==1]
feb = df[df['month']==2]
mar = df[df['month']==3]

plt.hist(jan.actual_productivity, alpha=.5, label='January')
plt.hist(feb.actual_productivity, alpha=.5, label='February')
plt.hist(mar.actual_productivity, alpha=.5, label='March')

plt.xlabel('Actual Productivity')
plt.ylabel('Count')
plt.legend()
plt.show()

We have a very good productivy last January.

Note: There are records of productivy higher than 1. (we assumed that productivity are scaled from 0-1)

feb_higher_prod = feb[feb.actual_productivity >1][['actual_productivity']]
print(len(feb_higher_prod))
feb_higher_prod.sort_values(by='actual_productivity', ascending=False).head(3)

28

There are 28 entries of productivity score higher that our intended max value(1). Is this a typo error or system error? We will leave this question for now.

Hypothesis Test¶

Anova and Turkey¶

Test the average productivity per month.

Null: All Months productivy are the same
Alt: Months productivity are not thesame

df['month_name'] = df.month.map({
    1:'January',
    2:'February',
    3:'March'
})

from scipy.stats import f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd

fstat, pval = f_oneway(jan.actual_productivity,
                       feb.actual_productivity,
                       mar.actual_productivity)
pval

0.0011729941363264461

tukey_results =  pairwise_tukeyhsd(df.actual_productivity, df.month_name, 0.05)
print(tukey_results)

  Multiple Comparison of Means - Tukey HSD, FWER=0.05  
=======================================================
 group1   group2 meandiff p-adj   lower   upper  reject
-------------------------------------------------------
February January   0.0375 0.0022  0.0114  0.0636   True
February   March    0.001 0.9974  -0.033   0.035  False
 January   March  -0.0365 0.0261 -0.0695 -0.0035   True
-------------------------------------------------------

sns.boxplot(x='month_name', y='actual_productivity', data=df )
plt.show()

January and February has significant difference in productivity
February and March has no significant difference
January and March has significant difference

EDA prior Machine Learning¶

Work in progress…¶

Outliers¶

#define functions
def showoutliers(df, column_name = ""):
        iqr = df[column_name].quantile(.75) - df[column_name].quantile(.25)
        
        # lower whisker
        lowerbound = (df[column_name].quantile(.25)) - iqr * 1.5 
        # upper whisker
        upperbound = (df[column_name].quantile(.75)) + iqr * 1.5
        
        # dfpoints beyond lower whisker
        lowerbound_outliers = df[df[column_name] < lowerbound]
        
        # adtapoint beyond upper whisker
        higherbound_outliers = df[df[column_name] > upperbound]
        
        # outliers
        outliers = pd.concat([lowerbound_outliers,higherbound_outliers])
        return outliers
def countoutliers(df, column_name = ""):
        iqr = df[column_name].quantile(.75) - df[column_name].quantile(.25)
        lowerbound = (df[column_name].quantile(.25)) - iqr * 1.5
        upperbound = (df[column_name].quantile(.75)) + iqr * 1.5
        lowerbound_outliers = df[df[column_name] < lowerbound]
        higherbound_outliers = df[df[column_name] > upperbound]
        outliers = pd.concat([lowerbound_outliers,higherbound_outliers])
        count = len(outliers)
        return {column_name : count}
def Replace_Outliers(df_name, value, column_name = ""):
    iqr = df_name[column_name].quantile(.75) - df_name[column_name].quantile(.25)
    
    lowerbound = (df_name[column_name].quantile(.25)) - iqr * 1.5
    upperbound = (df_name[column_name].quantile(.75)) + iqr * 1.5
            
    df_name[column_name] = np.where(df_name[column_name] > upperbound, value, df_name[column_name])
    df_name[column_name] = np.where(df_name[column_name] < lowerbound, value, df_name[column_name])

Create a dataset with only numeric values

# selected features only
df_n = df[[ 'targeted_productivity', 'smv', 'wip', 'over_time', 'incentive', 'no_of_workers', 'actual_productivity']]
# df_n

Count outliers

column_list = df_n.columns
column_list = np.array(column_list)
for i in df_n:
    print (countoutliers(df_n, i))

{'targeted_productivity': 79}
{'smv': 0}
{'wip': 9}
{'over_time': 1}
{'incentive': 34}
{'no_of_workers': 0}
{'actual_productivity': 54}

Proportion outliers

for i in column_list:
    col = i
    perc = countoutliers(df_n, i)[i] / len(df_n)
    print (col + ': ' + str('{:.2f}'.format(perc*100)) + '%')

targeted_productivity: 6.60%
smv: 0.00%
wip: 0.75%
over_time: 0.08%
incentive: 2.84%
no_of_workers: 0.00%
actual_productivity: 4.51%

df_n.plot(kind='box', 
          subplots=True, 
          sharey=False, 
          figsize=(20, 7))
# increase spacing between subplots
plt.subplots_adjust(wspace=0.5) 

plt.tight_layout()
plt.show()

There’s one columns that cought my attention. The ‘Incentive’, it has the most skewed and has high kurtosis values. Let’s focus on these and do some imputations.

Replace outliers with 75 percentile of the distribution

df_new.incentive.describe()

count    593.000000
mean      50.065767
std       17.237237
min       21.000000
25%       38.000000
50%       50.000000
75%       60.000000
max      100.000000
Name: incentive, dtype: float64

Replace_Outliers(df_new, 50, 'incentive')

# visualize
plt.figure(figsize=(4,3))
sns.boxplot(x='incentive', data=df_new)
plt.show()

	targeted_productivity	actual_productivity
work_week
ww1	0.774603	0.779225
ww2	0.746617	0.749937
ww3	0.715625	0.717431
ww4	0.704678	0.767010
ww5	0.737345	0.770341
ww6	0.723874	0.741566
ww7	0.727551	0.690275
ww8	0.729752	0.669766
ww9	0.718595	0.705212
ww_10	0.720879	0.737224

	no_of_workers	over_time	cummulative
work_week
ww1	4304.5	642450	642450
ww2	4303.5	777630	1420080
ww3	4216.0	722160	2142240
ww4	5720.5	895050	3037290
ww5	4182.0	479160	3516450
ww6	4155.0	370620	3887070
ww7	3436.5	306000	4193070
ww8	4210.0	471980	4665050
ww9	4057.0	491520	5156570
ww_10	2843.0	310680	5467250

	date	quarter	department	day	team	targeted_productivity	smv	wip	over_time	incentive	no_of_workers	actual_productivity
0	1/1/2015	Quarter1	sweing	Thursday	8	0.80	26.16	1108.0	7080	98	59.0	0.940725
1	1/1/2015	Quarter1	finishing	Thursday	1	0.75	3.94	NaN	960	0	8.0	0.886500
2	1/1/2015	Quarter1	sweing	Thursday	11	0.80	11.41	968.0	3660	50	30.5	0.800570
3	1/1/2015	Quarter1	sweing	Thursday	12	0.80	11.41	968.0	3660	50	30.5	0.800570
4	1/1/2015	Quarter1	sweing	Thursday	6	0.80	25.90	1170.0	1920	50	56.0	0.800382

	targeted_productivity	actual_productivity	diff_in_productivity
team
1	0.75	0.82	0.07
2	0.74	0.77	0.03
3	0.74	0.80	0.06
4	0.72	0.77	0.05
5	0.67	0.70	0.02
6	0.73	0.69	-0.05
7	0.71	0.67	-0.05
8	0.71	0.67	-0.03
9	0.76	0.73	-0.02
10	0.74	0.72	-0.02
11	0.70	0.68	-0.02
12	0.77	0.78	0.00

		targeted_productivity	actual_productivity	diff_productivity
team	department
1	finishing	0.75	0.83	0.08
1	sewing	0.74	0.82	0.07
2	finishing	0.75	0.78	0.03
2	sewing	0.73	0.76	0.03
3	finishing	0.74	0.85	0.11
3	sewing	0.74	0.78	0.03

	date	quarter	department	day	team	targeted_productivity	smv	incentive	no_of_workers	actual_productivity	month	work_week
1139	2015-03-09	Quarter2	finishing	Monday	1	0.75	3.94	960.0	8.0	0.794567	3	ww_10
1143	2015-03-09	Quarter2	finishing	Monday	2	0.70	3.90	1200.0	10.0	0.682500	3	ww_10
1137	2015-03-09	Quarter2	finishing	Monday	3	0.80	4.60	1440.0	12.0	0.795417	3	ww_10
1138	2015-03-09	Quarter2	finishing	Monday	4	0.75	3.94	960.0	8.0	0.795388	3	ww_10
1130	2015-03-09	Quarter2	finishing	Monday	5	0.60	3.94	2880.0	12.0	0.864343	3	ww_10
1149	2015-03-09	Quarter2	finishing	Monday	8	0.65	3.90	960.0	8.0	0.264062	3	ww_10
1133	2015-03-09	Quarter2	finishing	Monday	9	0.75	2.90	3600.0	15.0	0.841000	3	ww_10
1148	2015-03-09	Quarter2	finishing	Monday	10	0.70	2.90	960.0	8.0	0.477292	3	ww_10
1128	2015-03-09	Quarter2	finishing	Monday	11	0.80	2.90	960.0	8.0	0.960625	3	ww_10
1129	2015-03-09	Quarter2	finishing	Monday	12	0.80	4.60	1080.0	9.0	0.902963	3	ww_10

Garments Production¶

Load Libraries and Dataset¶

EDA¶

Dataset general information¶

Rename department column wrong typo’s and convert to categorical dtypes¶

Duplicates¶

Clean the Data (Null Values)¶

Visualize the null datapoints¶

Convert date to datetime dtype¶

Quater column to categorical dtypes¶

Day column to categorical dtypes¶

Inspect zero value (60% and up)¶

Feature Distribution¶

Skewness¶

Kurtosis()¶

Statistical Analysis¶

1. team¶

Hypothesis¶

2. target productivity¶

3. smv¶

4. wip¶

5. over_time¶

6. incentive¶

Without these zero values and outliers I think we could get a good meaning from this data.¶

Pairwise deletion/repalcement¶

Follow up question:¶

7. no_of_workers¶

Finding the best k value¶

Check for silhoutte score¶

Kmeans with 2 clusters¶

Assigning variables¶

Setting up centroids¶

Visualized¶

8. no_of_style_change¶

10. Prodcutivity per month¶

Hypothesis Test¶

Anova and Turkey¶

EDA prior Machine Learning¶

Work in progress…¶

Outliers¶

To be continue…¶

You Might Also Like

A/B Test

Logistic Regression

Linear Regression

Leave a Reply Cancel reply

1. `team`¶

2. `target productivity`¶

3. `smv`¶

4. `wip`¶

5. `over_time`¶

6. `incentive`¶

`Follow up question:`¶

7. `no_of_workers`¶

Finding the best `k` value¶