Garments Production (Work in Progress)

2Garments_worker-Copy1

Garments Production

This dataset includes important attributes of the garment manufacturing process and the productivity of the employees which had been collected manually and also been validated by the industry experts.

g.jpg

Attribute Information:

  1. date : Date in MM-DD-YYYY
  2. day : Day of the Week
  3. quarter : A portion of the month. A month was divided into four quarters
  4. department : Associated department with the instance
  5. team : Associated team number with the instance
  6. no_of_workers : Number of workers in each team
  7. no_of_style_change : Number of changes in the style of a particular product
  8. targeted_productivity : Targeted productivity set by the Authority for each team for each day.
  9. smv : Standard Minute Value, it is the allocated time for a task
  10. wip : Work in progress. Includes the number of unfinished items for products
  11. overtime : Represents the amount of overtime by each team in minutes
  12. incentive : Represents the amount of financial incentive (in BDT) that enables or motivates a particular course of action.
  13. idletime : The amount of time when the production was interrupted due to several reasons
  14. idlemen : The number of workers who were idle due to production interruption
  15. actual_productivity : The actual % of productivity that was delivered by the workers. It ranges from 0-1.

Load Libraries and Dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from scipy.stats import pearsonr
In [2]:
df = pd.read_csv('garments_worker_productivity.csv')
print(df.shape)
df.head()
(1197, 15)
Out[2]:
date quarter department day team targeted_productivity smv wip over_time incentive idle_time idle_men no_of_style_change no_of_workers actual_productivity
0 1/1/2015 Quarter1 sweing Thursday 8 0.80 26.16 1108.0 7080 98 0.0 0 0 59.0 0.940725
1 1/1/2015 Quarter1 finishing Thursday 1 0.75 3.94 NaN 960 0 0.0 0 0 8.0 0.886500
2 1/1/2015 Quarter1 sweing Thursday 11 0.80 11.41 968.0 3660 50 0.0 0 0 30.5 0.800570
3 1/1/2015 Quarter1 sweing Thursday 12 0.80 11.41 968.0 3660 50 0.0 0 0 30.5 0.800570
4 1/1/2015 Quarter1 sweing Thursday 6 0.80 25.90 1170.0 1920 50 0.0 0 0 56.0 0.800382
In [133]:
len(df[df['over_time']==0])
Out[133]:
31

EDA

Inspect, Clean, and Validate. Show underlying patterns and relationships within datasets.

Dataset general information

In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1197 entries, 0 to 1196
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   date                   1197 non-null   object 
 1   quarter                1197 non-null   object 
 2   department             1197 non-null   object 
 3   day                    1197 non-null   object 
 4   team                   1197 non-null   int64  
 5   targeted_productivity  1197 non-null   float64
 6   smv                    1197 non-null   float64
 7   wip                    691 non-null    float64
 8   over_time              1197 non-null   int64  
 9   incentive              1197 non-null   int64  
 10  idle_time              1197 non-null   float64
 11  idle_men               1197 non-null   int64  
 12  no_of_style_change     1197 non-null   int64  
 13  no_of_workers          1197 non-null   float64
 14  actual_productivity    1197 non-null   float64
dtypes: float64(6), int64(5), object(4)
memory usage: 140.4+ KB

Rename department column wrong typo’s and convert to categorical dtypes

In [4]:
df.department.value_counts()
Out[4]:
sweing        691
finishing     257
finishing     249
Name: department, dtype: int64
In [5]:
df['department'] = df['department'].replace(['sweing'], 'sewing')
df['department'] = df['department'].replace(['finishing '], 'finishing')
df['department'] = df.department.astype('category')

df.department.value_counts()
Out[5]:
sewing       691
finishing    506
Name: department, dtype: int64

Duplicates

In [6]:
# variable that store duplicates exactly the same as another row
dups = df.duplicated()
# count duplicates
dups.value_counts()
Out[6]:
False    1197
dtype: int64

Theres no duplicates.

Clean the Data (Null Values)

Check for the frequency of nulls in all the columns

In [7]:
df.isna().sum()
Out[7]:
date                       0
quarter                    0
department                 0
day                        0
team                       0
targeted_productivity      0
smv                        0
wip                      506
over_time                  0
incentive                  0
idle_time                  0
idle_men                   0
no_of_style_change         0
no_of_workers              0
actual_productivity        0
dtype: int64

Theres a large proportion of null values in wip(42%). The first approach is to investigate how this data was collected or structured.

We need to undestand what type of missing data we are dealing:

  • Structurally missing data
  • Missing Completely at Random (MCAR)
  • Missing at Random (MAR)
  • Missing Not at Random (MNAR)

Check the shape of finishing and sewing department seperately.

In [8]:
finishing = df[df['department']=='finishing']
sewing = df[df['department']=='sewing']

print(finishing.shape)
sewing.shape
(506, 15)
Out[8]:
(691, 15)

Check the total number of nulls in wip.

In [9]:
finishing.wip.isna().sum()
Out[9]:
506

Nova! all of the NaN entries are comming from finishing department. What we need to so now is to ask and verify if there is really no pending works(wip) in the finishing department, just to be sure. Since there is no one to ask for this project I will conclude that this missing data is labeled as Structurally Missing thus there is no wip in the finishing department.

Now were gonna impute these NaN values. We have a lot of techniques in imputing Nan’s (ex. time series LOCF and NOCB, pairwise, listwise etc.) in this scenario we will convert these NaN in to zero.

In [10]:
# Converting wip NaN values to zero
df.wip.fillna(0, inplace=True)

Visualize the null datapoints

We convert first all nulls into 0 before plotting thats because matplot library does’nt read nulls so if we plot our variables without converting, null will not show in the plot.

In [11]:
# Code preparation for plotting.
# Mapping departent to 0 and 1 as integer 
df['dept_mapped'] = df.department.map({
                                'finishing':0,
                                'sewing':1})
# convert to int dtype
df.dept_mapped  = df.dept_mapped.astype('int64')
In [12]:
# plt.figure(figsize=(1, 1)) 
g = sns.lmplot(x='dept_mapped', y='wip', 
               data=df, 
               x_jitter=0.15, 
               y_jitter=0.15, 
               fit_reg=False, 
               scatter_kws={'alpha':0.2})

g.set(xticks=range(2))
g.set_xticklabels(['finishing', 'sewing'])
plt.show()

Convert date to datetime dtype

In [13]:
# date format is MM-DD-YYYY
df['date'] = pd.to_datetime(df.date, format="%m/%d/%Y")

Augment data with additional columns(add month and work_week)

In [14]:
# add month
df['month'] = df['date'].dt.month
In [15]:
# Adding `work_week` columns. It's easier to analyze production data per work week.
# create a list of our conditions
conditions = [
    
    #January
    (df['month'] == 1) & (df['quarter'] == 'Quarter1'), #ww1
    (df['month'] == 1) & (df['quarter'] == 'Quarter2'), #ww2
    (df['month'] == 1) & (df['quarter'] == 'Quarter3'), #ww3
    (df['month'] == 1) & (df['quarter'] == 'Quarter4'), #ww4
    (df['month'] == 1) & (df['quarter'] == 'Quarter5'), #ww4
    
    #February
    (df['month'] == 2) & (df['quarter'] == 'Quarter1'), #ww5
    (df['month'] == 2) & (df['quarter'] == 'Quarter2'), #ww6
    (df['month'] == 2) & (df['quarter'] == 'Quarter3'), #ww7
    (df['month'] == 2) & (df['quarter'] == 'Quarter4'), #ww8
    (df['month'] == 2) & (df['quarter'] == 'Quarter5'), #ww8
    
    #March
    (df['month'] == 3) & (df['quarter'] == 'Quarter1'), #ww9
    (df['month'] == 3) & (df['quarter'] == 'Quarter2'), #ww_10
    (df['month'] == 3) & (df['quarter'] == 'Quarter3'), #ww_11
    (df['month'] == 3) & (df['quarter'] == 'Quarter4'), #ww_12
    (df['month'] == 3) & (df['quarter'] == 'Quarter5'), #ww_12
    ]

values = ['ww1','ww2','ww3','ww4','ww4',
          'ww5','ww6','ww7','ww8','ww8',
          'ww9','ww_10','ww_11','ww_12','ww_12']

df['work_week'] = np.select(conditions, values)

Quater column to categorical dtypes

In [16]:
quarter_order = ['Quarter1', 'Quarter2', 'Quarter3', 'Quarter4', 'Quarter5' ]
df.quarter = pd.Categorical(df.quarter, quarter_order, ordered=True)
# verify quarter order
df.quarter.unique()
Out[16]:
['Quarter1', 'Quarter2', 'Quarter3', 'Quarter4', 'Quarter5']
Categories (5, object): ['Quarter1' < 'Quarter2' < 'Quarter3' < 'Quarter4' < 'Quarter5']

Day column to categorical dtypes

In [17]:
df.day.unique()
Out[17]:
array(['Thursday', 'Saturday', 'Sunday', 'Monday', 'Tuesday', 'Wednesday'],
      dtype=object)
In [18]:
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Saturday', 'Sunday' ]
df.day = pd.Categorical(df.day, day_order, ordered=True)
# verify day order
df.day.unique()
Out[18]:
['Thursday', 'Saturday', 'Sunday', 'Monday', 'Tuesday', 'Wednesday']
Categories (6, object): ['Monday' < 'Tuesday' < 'Wednesday' < 'Thursday' < 'Saturday' < 'Sunday']

Inspect zero value (60% and up)

Let’s investigate this zero values.

In [19]:
# how many zeros are there in each columns?
# (df == 0).sum(axis=0) # uncomment to show result

print('Proportion of zeros: ')
(df == 0).sum(axis=0)  / len(df) * 100
Proportion of zeros: 
Out[19]:
date                      0.000000
quarter                   0.000000
department                0.000000
day                       0.000000
team                      0.000000
targeted_productivity     0.000000
smv                       0.000000
wip                      42.272348
over_time                 2.589808
incentive                50.459482
idle_time                98.496241
idle_men                 98.496241
no_of_style_change       87.719298
no_of_workers             0.000000
actual_productivity       0.000000
dept_mapped              42.272348
month                     0.000000
work_week                 0.000000
dtype: float64

We have a very large proportion of zero value in wip, incentive, idle_time, idle_men, and no_of_style_change.

  • wip This 42% of zeros are came from finishing department alternately the remaining 58% are work in progress by the sewing department.
  • incentive 50% or half of the total production days got incentive
  • idle_time This large proportion is a good indication that the the production is running smoothly.
  • idle_men Same as idle_time, most of the teams/workers are working deligently. 98% are zero value.
  • no_of_style_change 87.72% are zero values. Seems fair there is only 12% of change of product style that caused idle_time. Changing style might result to idle_time if not properly executed.

Feature Distribution

After doing some cleaning and data wrangling, we can now plot the histogram of the dataset for inspection.

In [20]:
# we will not include date in this visualization
df.drop(columns=['date']).hist(figsize=(10, 7))
plt.tight_layout()
plt.show()

Skewness

Skewness is a measure of asymmetry of a distribution. Important Notes:

  • between -0.5 and 0.5, fairly symmetrical
  • between -1 and -0.5 or between 0.5 and 1, moderately skewed
  • less than -1 or greater than 1, highly skewed
In [21]:
# select numeric columns only
df_numeric = df.select_dtypes(include=np.number).columns.tolist()
for i in df_numeric:
    print(i + ' ==> ' + str(round(df[i].skew(),2)))
team ==> 0.01
targeted_productivity ==> -2.14
smv ==> 0.41
wip ==> 10.85
over_time ==> 0.67
incentive ==> 15.79
idle_time ==> 20.55
idle_men ==> 9.86
no_of_style_change ==> 2.94
no_of_workers ==> -0.11
actual_productivity ==> -0.81
dept_mapped ==> -0.31
month ==> 0.49

Kurtosis()

kurtosis determines the heaviness of the distribution tails. Determine the volume of the outlier.

  • If the distribution is tall and thin it is called a leptokurtic distribution(Kurtosis > 3). Values in a leptokurtic distribution are near the mean or at the extremes.
  • A flat distribution where the values are moderately spread out is called platykurtic(Kurtosis <3) distribution.
  • A distribution whose shape is in between a leptokurtic distribution and a platykurtic distribution is called a mesokurtic(Kurtosis=3) distribution. A mesokurtic distribution looks more close to a normal distribution.

‘Note’

  1. High kurtosis in a df set is an indicator that df has heavy outliers.
  2. Low kurtosis in a df set is an indicator that df has lack of outliers.
In [22]:
df.select_dtypes(include=np.number).kurt()
Out[22]:
team                      -1.223906
targeted_productivity      5.613701
smv                       -0.795346
wip                      141.332609
over_time                  0.424364
incentive                299.032462
idle_time                442.638160
idle_men                 102.962869
no_of_style_change         8.181490
no_of_workers             -1.788108
actual_productivity        0.333227
dept_mapped               -1.905055
month                     -1.057430
dtype: float64

Statistical Analysis

1. team

Fairly distributed and flat. We will group the dataset by team then get the average productivity.

  • Whats the best team so far?

note:

  • Productivity is scaled from 0-1(0-100%)
  • Mean is more preferable than median if the distribution is fairly symmetrical.
In [23]:
team = df.groupby('team').mean() # groupby then get the mean
team['diff_in_productivity'] = team.actual_productivity - team.targeted_productivity

# limit to 2 decimal places
team = team.round(2)

team[['targeted_productivity', 'actual_productivity', 'diff_in_productivity']]
Out[23]:
targeted_productivity actual_productivity diff_in_productivity
team
1 0.75 0.82 0.07
2 0.74 0.77 0.03
3 0.74 0.80 0.06
4 0.72 0.77 0.05
5 0.67 0.70 0.02
6 0.73 0.69 -0.05
7 0.71 0.67 -0.05
8 0.71 0.67 -0.03
9 0.76 0.73 -0.02
10 0.74 0.72 -0.02
11 0.70 0.68 -0.02
12 0.77 0.78 0.00

Top 3 Productive Team

In [24]:
team.actual_productivity.sort_values(ascending=False).head(3)
Out[24]:
team
1     0.82
3     0.80
12    0.78
Name: actual_productivity, dtype: float64

Visualization

In [25]:
#
x = team.index
y = team.actual_productivity

plt.figure(figsize=(8,5))
ax1 = plt.subplot()
sns.barplot(x=x, y=y, palette = 'deep')
plt.title('Productivity by Team')

for i in ax1.containers:
    ax1.bar_label(i,)

plt.tight_layout()    
plt.show()

Team productivity by department

In [26]:
team_dept = df.groupby(['team','department']).mean() 
team_dept['diff_productivity'] = team_dept.actual_productivity - team_dept.targeted_productivity
team_dept = team_dept.round(2)

# Show first 3 teams
team_dept[['targeted_productivity', 'actual_productivity', 'diff_productivity']].head(6)
Out[26]:
targeted_productivity actual_productivity diff_productivity
team department
1 finishing 0.75 0.83 0.08
sewing 0.74 0.82 0.07
2 finishing 0.75 0.78 0.03
sewing 0.73 0.76 0.03
3 finishing 0.74 0.85 0.11
sewing 0.74 0.78 0.03

Visualization

In [27]:
# code preparation for visualization
team_act_prod = []
for i in team_dept.actual_productivity:
    team_act_prod.append(i)

sewing = team_act_prod[1::2]
finishing= team_act_prod[0::2]

# team_dept is a 2d list
# example code for accessing 2d indeces
# team 2 finishing dept would be:
# team_dept.actual_productivity[2][0]
In [28]:
#
fig, ax = plt.subplots(figsize=(12,6))

x = np.arange(1,13)
width = 0.35  # the width of the bars

rects1 = ax.bar(x - width/2, finishing, width, label= 'finishing')
rects2 = ax.bar(x + width/2, sewing, width, label='sewing')
ax.set_xticks(x)
ax.set_xlabel('Teams')
ax.set_ylabel('Actual Productivity')

def autolabel(rects):
    """Attach a text label above each bar in *rects*, displaying its height."""
    for rect in rects:
        height = rect.get_height()
        ax.annotate('{}'.format(height),
                    xy=(rect.get_x() + rect.get_width() / 2, height),
                    xytext=(0, 3),  # 3 points vertical offset
                    textcoords="offset points",
                    ha='center', va='bottom')

plt.title('Actual Productivity by Team Department')
autolabel(rects1)
autolabel(rects2)
plt.tight_layout()
plt.legend(loc='upper center')
plt.show()

Hypothesis

Two T-test

Comapare team 1 and 3 since their productivity score are pretty close.

  • null: There is no significant difference between team1 and team3 with respect to their productivity.
  • alt: There is significant difference between team1 and team3.
In [121]:
from scipy.stats import ttest_ind
In [126]:
team1 = df[df.team == 1]
team3 = df[df.team == 3]
In [127]:
plt.hist(team1.actual_productivity, alpha=0.5, label='team1')
plt.hist(team3.actual_productivity,alpha=0.5, label='team3')
plt.legend()
plt.show()
In [129]:
# Check if STD is equal
# a ratio between 0.9 and 1.1 should suffice
ratio = np.std(team1.actual_productivity) / np.std(team3.actual_productivity)
ratio
Out[129]:
1.0846997401445262
In [131]:
#run the t-test here:
tstat, pval =ttest_ind(team1.actual_productivity, team3.actual_productivity)
pval
Out[131]:
0.41425379288460873

Productivity between team1 and team3 has no significant difference.

2. target productivity

Heavily left skewed. This is great, target productivy should have more high values.

  • Predicted vs. Actual Productivity

Note

  • We group our dataset by work_week then we get the mean. Anternativily we can use the median if the distribution of particular work_week is in heavily skewed shape. For simplicity we just use the mean. I already check the overall distribution (ww1-ww10) and it’s in fairly bell shaped.
In [29]:
print('Average Productivity per work_week')
target_actual = df.groupby('work_week').mean()
target_actual[['targeted_productivity', 'actual_productivity']]
Average Productivity per work_week
Out[29]:
targeted_productivity actual_productivity
work_week
ww1 0.774603 0.779225
ww2 0.746617 0.749937
ww3 0.715625 0.717431
ww4 0.704678 0.767010
ww5 0.737345 0.770341
ww6 0.723874 0.741566
ww7 0.727551 0.690275
ww8 0.729752 0.669766
ww9 0.718595 0.705212
ww_10 0.720879 0.737224

Visualization

In [30]:
# setting variables
# target productivity average per workweek
tp_avg = df.groupby('work_week')['targeted_productivity'].mean()
# actual productivity average per workweek
ap_avg = df.groupby('work_week')['actual_productivity'].mean()
In [31]:
plt.figure(figsize=(10,5))

plt.plot(tp_avg.index, tp_avg, marker='o')
plt.plot(ap_avg.index, ap_avg, marker='o')

plt.xlabel('work weeks')
plt.ylabel('productivity score')
plt.legend(['targeted productivity', 'actual productivity'])

plt.show()

We have a very good ww4 and ww5 however after ww5 our productivity crashed to a very low level.

In [32]:
# Checking the reason behind the crash
idle = df.groupby('work_week').sum()
idle['idle_time']
Out[32]:
work_week
ww1        0.0
ww2        0.0
ww3        0.0
ww4        0.0
ww5      810.0
ww6        0.0
ww7       38.0
ww8        8.0
ww9       18.0
ww_10      0.0
Name: idle_time, dtype: float64

We have a major downtime during ww5 that caused the fall of actual productivity.

3. smv

Distribution is faily symmetrical

  • Check if theres variables correlated to smv(ex. Is smv correlate to over_time?
In [33]:
def correlation_checker(col):
    plt.figure(figsize=(6, 6)) 
    corr_matrix = df.corr()
 
    corr_target = corr_matrix[[col]].drop(labels=[col])
    corr_target = corr_target.sort_values(by=[col], ascending=False)

    sns.heatmap(corr_target, annot=True, fmt='.3', cmap='RdBu_r')
    plt.show()
In [34]:
correlation_checker('smv')

There are three feature’s that has high correlation to smv. However I’m more interested in no_of_workers and over_time, let’s do scatter plot these features againts smv.

In [35]:
df_pairplot = df[['smv', 'no_of_workers', 'over_time' ]]
In [36]:
sns.pairplot(df_pairplot)
plt.show()