Work in Progress…¶
Garments Production¶
Attribute Information:
- date : Date in MM-DD-YYYY
- day : Day of the Week
- quarter : A portion of the month. A month was divided into four quarters
- department : Associated department with the instance
- team : Associated team number with the instance
- no_of_workers : Number of workers in each team
- no_of_style_change : Number of changes in the style of a particular product
- targeted_productivity : Targeted productivity set by the Authority for each team for each day.
- smv : Standard Minute Value, it is the allocated time for a task
- wip : Work in progress. Includes the number of unfinished items for products
- overtime : Represents the amount of overtime by each team in minutes
- incentive : Represents the amount of financial incentive (in BDT) that enables or motivates a particular course of action.
- idletime : The amount of time when the production was interrupted due to several reasons
- idlemen : The number of workers who were idle due to production interruption
- actual_productivity : The actual % of productivity that was delivered by the workers. It ranges from 0-1.
Load Libraries and Dataset¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from scipy.stats import pearsonr
df = pd.read_csv('garments_worker_productivity.csv')
print(df.shape)
df.head()
EDA¶
Inspect, Clean, and Validate. Show underlying patterns and relationships within datasets.
Firstly we need to show all the variable distribution for reference.¶
# we will not include date in this visualization
df.drop(columns=['date']).hist(figsize=(10, 7))
plt.tight_layout()
plt.show()
Check for skewness¶
Skewness is a measure of asymmetry of a distribution.
Important Notes:
- between -0.5 and 0.5, fairly symmetrical
- between -1 and -0.5 or between 0.5 and 1, moderately skewed
- less than -1 or greater than 1, highly skewed
# selected columns only
df_int = ['team', 'targeted_productivity', 'smv', 'wip', 'over_time', 'incentive', 'no_of_workers', 'actual_productivity']
for i in df_int:
print(i + ' ==> ' + str(round(df[i].skew(),2)))
Kurtosis()¶
kurtosis determines the heaviness of the distribution tails. Determine the volume of the outlier.
- If the distribution is tall and thin it is called a
leptokurtic
distribution(Kurtosis > 3). Values in a leptokurtic distribution are near the mean or at the extremes. - A flat distribution where the values are moderately spread out is called
platykurtic
(Kurtosis <3) distribution. - A distribution whose shape is in between a leptokurtic distribution and a platykurtic distribution is called a
mesokurtic
(Kurtosis=3) distribution. A mesokurtic distribution looks more close to a normal distribution.
‘Note’
- High kurtosis in a df set is an indicator that df has heavy outliers.
- Low kurtosis in a df set is an indicator that df has lack of outliers.
- If kurtosis value + means pointy and — means flat.
df.kurt()
team
s Activity seems fairly distributed and flattarget productivity
Heavily left skewed which is a great. Target productivy should be in increasing value.smv
Distribution is faily symmetricalwip
Heavily right skewed, as seen in above graph outliers are affecting the distribution!. Kurtosis at 110 indicating that the distribution is very tall. Also distribution has heavy outliers.over_time
Moderately skewedincentive
Heavily right skewed. There must some extremely high incentives!. Kurtosis at 299 we should do an investigation with the outliers.no_of_worker
Looks like a bimodal distribution, we can form a clusters/groups for this kind of distribution.Actual_productiviy
Moderately left skewed. This is a good sign.
Dataset general information¶
df.info()
task:
Work in Progress…
Rename department column wrong typo’s and convert to categorical dtypes¶
df.department.value_counts()
df['department'] = df['department'].replace(['sweing'], 'sewing')
df['department'] = df['department'].replace(['finishing '], 'finishing')
df['department'] = df.department.astype('category')
df.department.value_counts()
Duplicates¶
# variable that store duplicates exactly the same as another row
dups = df.duplicated()
# count duplicates
dups.value_counts()
Theres no duplicates.
Clean the Data (Null Values)¶
Check for the frequency of nulls in all the columns
df.isna().sum()
Theres a large proportion of null values in wip
(42%). The first approach is to investigate how this data was collected or structured.
We need to undestand what type of missing data we are dealing:
- Structurally missing data
- Missing Completely at Random (MCAR)
- Missing at Random (MAR)
- Missing Not at Random (MNAR)
Check the shape of finishing and sewing department seperately.
finishing = df[df['department']=='finishing']
sewing = df[df['department']=='sewing']
print(finishing.shape)
sewing.shape
Check the total number of nulls in wip.
finishing.wip.isna().sum()
Nova! all of the NaN entries are comming from finishing
department. What we need to so now is to ask and verify if there is really no pending works(wip) in the finishing department, just to be sure. Since there is no one to ask for this project I will conclude that this missing data is labeled as Structurally Missing
thus there is no wip in the finishing department.
The graph shows that there is no records of
Now were gonna impute these NaN values. We have a lot of techniques in imputing Nan’s (ex. time series LOCF and NOCB, pairwise, listwise etc.) in this scenario we will convert these NaN in to zero.
# Converting wip NaN values to zero
df.wip.fillna(0, inplace=True)
Visualize the null datapoints¶
We convert first all nulls into 0 before plotting thats because matplot library does’nt read nulls so if we plot our variables without converting, null will not show in the plot.
# Code preparation for plotting.
# Mapping departent to 0 and 1 as integer
df['dept_mapped'] = df.department.map({
'finishing':0,
'sewing':1})
# convert to int dtype
df.dept_mapped = df.dept_mapped.astype('int64')
# plt.figure(figsize=(1, 1))
g = sns.lmplot(x='dept_mapped', y='wip',
data=df,
x_jitter=0.15,
y_jitter=0.15,
fit_reg=False,
scatter_kws={'alpha':0.2})
g.set(xticks=range(2))
g.set_xticklabels(['finishing', 'sewing'])
plt.show()
Convert date to datetime dtype¶
# date format is MM-DD-YYYY
df['date'] = pd.to_datetime(df.date, format="%m/%d/%Y")
- Augment data with additional columns
# add month
df['month'] = df['date'].dt.month
# Adding `work_week` columns. It's easier to analyze production data per work week.
# create a list of our conditions
conditions = [
#January
(df['month'] == 1) & (df['quarter'] == 'Quarter1'), #ww1
(df['month'] == 1) & (df['quarter'] == 'Quarter2'), #ww2
(df['month'] == 1) & (df['quarter'] == 'Quarter3'), #ww3
(df['month'] == 1) & (df['quarter'] == 'Quarter4'), #ww4
(df['month'] == 1) & (df['quarter'] == 'Quarter5'), #ww4
#February
(df['month'] == 2) & (df['quarter'] == 'Quarter1'), #ww5
(df['month'] == 2) & (df['quarter'] == 'Quarter2'), #ww6
(df['month'] == 2) & (df['quarter'] == 'Quarter3'), #ww7
(df['month'] == 2) & (df['quarter'] == 'Quarter4'), #ww8
(df['month'] == 2) & (df['quarter'] == 'Quarter5'), #ww8
#March
(df['month'] == 3) & (df['quarter'] == 'Quarter1'), #ww9
(df['month'] == 3) & (df['quarter'] == 'Quarter2'), #ww_10
(df['month'] == 3) & (df['quarter'] == 'Quarter3'), #ww_11
(df['month'] == 3) & (df['quarter'] == 'Quarter4'), #ww_12
(df['month'] == 3) & (df['quarter'] == 'Quarter5'), #ww_12
]
values = ['ww1','ww2','ww3','ww4','ww4',
'ww5','ww6','ww7','ww8','ww8',
'ww9','ww_10','ww_11','ww_12','ww_12']
df['work_week'] = np.select(conditions, values)
Quater column to categorical dtypes¶
quarter_order = ['Quarter1', 'Quarter2', 'Quarter3', 'Quarter4', 'Quarter5' ]
df.quarter = pd.Categorical(df.quarter, quarter_order, ordered=True)
# verify quarter order
df.quarter.unique()
Day column to categorical dtypes¶
df.day.unique()
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Saturday', 'Sunday' ]
df.day = pd.Categorical(df.day, day_order, ordered=True)
# verify day order
df.day.unique()
Inspect zero value (60% and up)¶
Let’s investigate this zero values.
# how many zeros are there in each columns?
# (df == 0).sum(axis=0) # uncomment to show result
print('Proportion of zeros: ')
(df == 0).sum(axis=0) / len(df) * 100
We have a very large proportion of zero value in wip
, incentive
, idle_time
, idle_men
, and no_of_style_change
.
wip
This 42% of zeros are came from finishing department alternately the remaining 58% are work in progress by the sewing department.incentive
50% or half of the total production days got incentiveidle_time
This large proportion is a good indication that the the production is running smoothly.idle_men
Same as idle_time, most of the teams/workers are working deligently. 98% are zero value.no_of_style_change
87.72% are zero values. Seems fair there is only 12% of change of product style that caused idle_time. Changing style might result to idle_time if not properly executed.
1. Does no_of_style_change produce idle_time?¶
Frequency of unique value in no_of_style_change
df.no_of_style_change.value_counts()
Proportion
import plotly.express as px
fig = px.pie(df, 'no_of_style_change')
fig.update_layout(title="Proportion of no_of_style_change")
fig.update_traces(textposition='inside',
textinfo='percent+label', showlegend=True)
fig.update_traces(pull= 0.02)
fig.show('png')
Correlation between no_of_style_change and idle_time.
from scipy.stats import pearsonr
corr, p = pearsonr(df.no_of_style_change, df.idle_time)
corr, p
There’s no significant correlation between no_of_style_change and idle_time. Also the pval is at very high value.
# visualize
# plt.figure(figsize=(1, 1))
g = sns.lmplot(x='no_of_style_change', y='idle_time',
data=df,
x_jitter=0.15,
y_jitter=0.15,
fit_reg=False,
scatter_kws={'alpha':0.5})
g.set(xticks=range(3))
g.set_xticklabels(df.no_of_style_change.unique())
g.set(ylim=(0,10))
plt.show()
- There is 147(12.28%) total number change in production
- Changing style can cause idle time up to 8 minutes
2. Targeted productivity vs Actual productivity¶
Note
Prioritizing the actual productivity, we group our dataset by work_week then we get the mean instead of median(we use median for heavily skew distribution). Checking the shape of its overall distribution(ww1-ww10) we can see a moderately skew shape(skewnew at -.81 kurtusis at .33). To be more precise we can check the distribution per work_week but I think it’s not necessary here.
target_actual = df.groupby('work_week').mean()
target_actual[['targeted_productivity', 'actual_productivity']]
Visualization
# setting variables
# target productivity average per workweek
tp_avg = df.groupby('work_week')['targeted_productivity'].mean()
# actual productivity average per workweek
ap_avg = df.groupby('work_week')['actual_productivity'].mean()
plt.figure(figsize=(10,5))
plt.plot(tp_avg.index, tp_avg, marker='o')
plt.plot(ap_avg.index, ap_avg, marker='o')
plt.xlabel('work weeks')
plt.ylabel('productivity score')
plt.legend(['targeted productivity', 'actual productivity'])
plt.show()
We have a very good ww4 and ww5 however after ww5 our productivity crashed to a very low level.
# Checking the reason behind the crash
idle = df.groupby('work_week').sum()
idle['idle_time']
We have a major downtime during ww5 that caused the fall of actual productivity.
4. How much over_time does production have per work_week? Also the number of workers?¶
# set variables
ww = df.groupby('work_week').sum() # we use groupby then sum
ww['cummulative'] = ww.over_time.cumsum()
ww[['over_time', 'cummulative', 'no_of_workers']]
- There are 642450 overtime minutes from work week1 with 4305 workers. An average of 2.48 OT hours per worker in ww1. (642450 / 4305 we will get 149 minutes then we divide by 60 to get 2.48 hours)
Visualization¶
Over time per work week and it’s cummulative value.
#
plt.figure(figsize=(14, 6))
# set variables
x = ww.index
y = ww.over_time
# plot1
# sum of overtime per work_week
ax1 = plt.subplot(1,2,1)
sns.barplot(x = x, y = y, data=df)
ax1.set_xticklabels(ax1.get_xticklabels(),rotation = 80)
plt.title('Overtime per Work Week ')
for i in ax1.containers:
ax1.bar_label(i,)
#plot2
# cumulative sum of work_week
ax2 = plt.subplot(1,2,2)
sns.lineplot(x=x, y=ww.over_time.cumsum(), marker='o', )
plt.title('Cumulative Overtime')
plt.tight_layout()
plt.show()
5. What the best team so far?¶
note:
Productivity is scaled from 0-1(0-100%)
- We will group by team then find the mean(avg) of it’s productivity.
Note
we get the mean instead of median. I’m prioritizing the actual productivity. Checkingthe shape of its distribution we can see a moderately skew(skewnew at -.81 kurtusis at .33). Unless it’s heavily skew then we can use the median.
team = df.groupby('team').mean() # groupby then get the mean
team['diff_in_productivity'] = team.actual_productivity - team.targeted_productivity
# limit to 2 decimal places
team = team.round(2)
team[['targeted_productivity', 'actual_productivity', 'diff_in_productivity']]
Top 3 Productive Team
team.actual_productivity.sort_values(ascending=False).head(3)
Visualization¶
#
x = team.index
y = team.actual_productivity
plt.figure(figsize=(8,5))
ax1 = plt.subplot()
sns.barplot(x=x, y=y, palette = 'deep')
plt.title('Productivity by Team')
for i in ax1.containers:
ax1.bar_label(i,)
plt.tight_layout()
plt.show()
Team productivity by department
team_dept = df.groupby(['team','department']).mean()
team_dept['diff_productivity'] = team_dept.actual_productivity - team_dept.targeted_productivity
team_dept = team_dept.round(2)
# Show first 3 teams
team_dept[['targeted_productivity', 'actual_productivity', 'diff_productivity']].head(6)
Visualization
# code preparation for visualization
team_act_prod = []
for i in team_dept.actual_productivity:
team_act_prod.append(i)
sewing = team_act_prod[1::2]
finishing= team_act_prod[0::2]
# team_dept is a 2d list
# example code for accessing 2d indeces
# team 2 finishing dept would be:
# team_dept.actual_productivity[2][0]
#
fig, ax = plt.subplots(figsize=(12,6))
x = np.arange(1,13)
width = 0.35 # the width of the bars
rects1 = ax.bar(x - width/2, finishing, width, label= 'finishing')
rects2 = ax.bar(x + width/2, sewing, width, label='sewing')
ax.set_xticks(x)
ax.set_xlabel('Teams')
ax.set_ylabel('Actual Productivity')
def autolabel(rects):
"""Attach a text label above each bar in *rects*, displaying its height."""
for rect in rects:
height = rect.get_height()
ax.annotate('{}'.format(height),
xy=(rect.get_x() + rect.get_width() / 2, height),
xytext=(0, 3), # 3 points vertical offset
textcoords="offset points",
ha='center', va='bottom')
autolabel(rects1)
autolabel(rects2)
plt.tight_layout()
plt.legend(loc='upper center')
plt.show()
Statistical Analysis¶
Descriptive¶
Univiriate Analysis¶
1.
What is the average and median overtime per work_week? .
ot_avg_ww = df.groupby('work_week').mean()['over_time']
ot_mdn_ww = df.groupby('work_week').median()['over_time']
ot_avg_ww, ot_mdn_ww
2.
Highest and lowest incentive? (except zero for lowest)
# highest incentive
inc_max = np.max(df.incentive)
inc_min = df.incentive[df.incentive !=0].min()
inc_max, inc_min
df.month.unique()
sns.boxplot(x='month', y='actual_productivity', data=df)
plt.show()
jan = df[df['month']==1]
feb = df[df['month']==2]
mar = df[df['month']==3]
plt.hist(jan.actual_productivity, alpha=.5, label='January')
plt.hist(feb.actual_productivity, alpha=.5, label='February')
plt.hist(mar.actual_productivity, alpha=.5, label='March')
plt.xlabel('Actual Productivity')
plt.ylabel('Count')
plt.legend()
plt.show()
feb.actual_productivity.max()
print(len(feb.actual_productivity[feb.actual_productivity >1]))
feb.actual_productivity[feb.actual_productivity >1].head(3)
# This are all the actual productivity higher than 1
# show first 5 out of 28
feb[feb.actual_productivity >1].head(5)
jan.actual_productivity.max()
Why theres actual productivy higher than 1(100)?
Biviriate¶
- What team has more overtime on average?
Outliers¶
#define functions
def showoutliers(df, column_name = ""):
iqr = df[column_name].quantile(.75) - df[column_name].quantile(.25)
# lower whisker
lowerbound = (df[column_name].quantile(.25)) - iqr * 1.5
# upper whisker
upperbound = (df[column_name].quantile(.75)) + iqr * 1.5
# dfpoints beyond lower whisker
lowerbound_outliers = df[df[column_name] < lowerbound]
# adtapoint beyond upper whisker
higherbound_outliers = df[df[column_name] > upperbound]
# outliers
outliers = pd.concat([lowerbound_outliers,higherbound_outliers])
return outliers
def countoutliers(df, column_name = ""):
iqr = df[column_name].quantile(.75) - df[column_name].quantile(.25)
lowerbound = (df[column_name].quantile(.25)) - iqr * 1.5
upperbound = (df[column_name].quantile(.75)) + iqr * 1.5
lowerbound_outliers = df[df[column_name] < lowerbound]
higherbound_outliers = df[df[column_name] > upperbound]
outliers = pd.concat([lowerbound_outliers,higherbound_outliers])
count = len(outliers)
return {column_name : count}
def Replace_Outliers(df_name, value, column_name = ""):
iqr = df_name[column_name].quantile(.75) - df_name[column_name].quantile(.25)
lowerbound = (df_name[column_name].quantile(.25)) - iqr * 1.5
upperbound = (df_name[column_name].quantile(.75)) + iqr * 1.5
df_name[column_name] = np.where(df_name[column_name] > upperbound, value, df_name[column_name])
df_name[column_name] = np.where(df_name[column_name] < lowerbound, value, df_name[column_name])
# create a dfset with only numeric values
df_n = df[['team', 'targeted_productivity', 'smv', 'wip', 'over_time', 'incentive', 'no_of_workers', 'actual_productivity']]
# df_n
# Count outliers
column_list = df_n.columns
column_list = np.array(column_list)
for i in column_list:
print (countoutliers(df_n, i))
# Proportion outliers
for i in column_list:
col = i
perc = countoutliers(df_n, i)[i] / len(df_n)
print (col + ': ' + str('{:.2f}'.format(perc*100)) + '%')
df_n.plot(kind='box',
subplots=True,
sharey=False,
figsize=(20, 7))
# increase spacing between subplots
plt.subplots_adjust(wspace=0.5)
There’s two columns that cought my attention. The ‘WIP’ and ‘Incentive’. These two has the most skewed and has high kurtosis values. Let’s focus on these two variables and do some imputations.
WIP¶
Replace WIP outliers(2.84%) with 75 percentile of the distribution.
df.wip.describe()
Replace_Outliers(df, 1218, 'wip')
# Shewness
print(df.wip.skew())
# Kurtusis
df.wip.kurt()
Incentive¶
df.incentive.describe()
sns.histplot(x='incentive', df=df)
plt.xlim(0,250)
df.incentive.value_counts()
corr, p = pearsonr(df.actual_productivity, df.incentive)
print(corr, p)
sns.regplot(x = df.actual_productivity, y = df.incentive,
scatter_kws={'alpha':0.4})