IMDB¶
IMDb is the world’s most popular and authoritative source for movie, TV and celebrity content. Find ratings and reviews for the newest movie and TV shows.
Data Source: IMDB Movies Dataset
Objectives¶
- Perform Feature Engineering, clean, wraggling and tidy then save the new dataset in a .csv file. The new file will be use in machine learning model, KNN and Decision Tree.
Data Dictionary:
Poster_Link
– Link of the poster that imdb usingSeries_Title
= Name of the movieReleased_Year
– Year at which that movie releasedCertificate
– Certificate earned by that movieRuntime
– Total runtime of the movieGenre
– Genre of the movieIMDB_Rating
– Rating of the movie at IMDB siteOverview
– mini story/ summaryMeta_score
– Score earned by the movieDirector
– Name of the DirectorStar1,Star2,Star3,Star4
– Name of the StarsNoofvotes
– Total number of votesGross
– Money earned by that movie
Feature Engineering¶
Content:
- Import Packages and Load Data
- Converting to Appropriate Data Types
- Check for Null Values (Nan and zero value, impute if necessary)
- Outliers (impute if necessary)
- Conclusion
1. Import Packages and Load Data¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
Show initial dataset¶
# define training points and training labels
df = pd.read_csv('imdb_top_1000.csv')
print(df.shape)
df.head(2)
# Inpsecting columns
df.columns
Drop unnecessary columns¶
df = df.drop(columns=['Poster_Link', 'Overview'])
2. Converting to Appropriate Data Types¶
df.info()
We will convert the Released_Year
, Runtime
, and Gross
to integer data types.
Released_Year to Integer¶
# showing the string in 'Release_Year'
df['Released_Year'].unique()
Since only 1 row have missing value, let’s look at it. We need to convert this ‘PG’ to an integer data type.
# Show detailed information
df[df['Released_Year'].isin(['PG'])]
After conducting a reseach about the Apollo 13
movie. We found out that the release date was November 15, 1995.
We will replace PG
with the year 1995
.
# replce 'PG' with 1995 and convert to integer
df['Released_Year'] = df['Released_Year'].replace(['PG'] , 1995).astype(int)
df.iloc[[966]]
Runtime to Integer¶
df.Runtime.unique()[0:10]
Before converting to integer, we need to split the observations and remove the string keyword ‘min’.
# remove 'min' keyword and convert to int.
df['Runtime'] = df['Runtime'].str.rstrip('min').astype('int')
Gross to Integer¶
# Remove comma
df['Gross'] = df['Gross'].str.replace(',', '').astype('float')
Final checking for datatypes
df.info()
3. Check for Null Values¶
df.isna().sum()
We have null values in Certificate, Meta_score, and Gross columns let’s check and convert these null to an aggregate value.
Certificate null values¶
Note: Certicates
are movies restriction for audinece:
- U is unrestricted, suitable for anyone
- A adult only, indicate films high in violence or mature content that should not be marketed to teenagers
# unique values
df.Certificate.unique()
Since we dont know the certificate of these movies and we dont want to replace it with the most occurence value, we will just tag this as Unrated
. Maybe it’s null because these movies are not rated yet. Some further investegation is needed for this scenario.
# Replace nan with 'Unrated'
df.Certificate = df.Certificate.fillna('Unrated')
Meta_score null values¶
df.Meta_score.unique()
There are 157 nan values in our Meta_score. We can replace this by it’s median or mean. For now let’s replace it with the mean value.
df.Meta_score = round(df.Meta_score.fillna((df.Meta_score.mean())))
df.Meta_score.unique()
Gross null values¶
There are 169 null values in Gross
. With our general knowledge we know that ‘gross’ is an important feature and we can’t safely delete rows with null value that are more than 5%.
We have two methods we can apply to replace this null;
- Replace with the mean value of our total
Gross
. - Replace with some aggregated value. In this case the mean of it’s belong category in
IMDB_Rating
.
We have both codes below, but since we are preaparing this dataset for KNN classification, it’s better to replace it with method number 2.
# # Replace Nan with zero
df['Gross'].fillna(0, inplace = True)
# # Change type to int
df['Gross'] = df['Gross'].astype(int)
# Method 2
# # Repalce zero with Gross mean
# df['Gross'] = df['Gross'].replace([0], np.mean(df.Gross))
I know my code for this is super messy, bear with it for now so we can proceed with our analysis. We will replace this code in the near future.. 😉
# Method 2
# Subsets by IMDB_Rating
# Replace zero gross value by its IMDB Rating mean
s76 = df[df['IMDB_Rating'].isin(['7.6'])]
s76['Gross'] = s76['Gross'].replace([0], np.mean(s76.Gross))
s77 = df[df['IMDB_Rating'].isin(['7.7'])]
s77['Gross'] = s77['Gross'].replace([0], np.mean(s77.Gross))
s78 = df[df['IMDB_Rating'].isin(['7.8'])]
s78['Gross'] = s78['Gross'].replace([0], np.mean(s78.Gross))
s79 = df[df['IMDB_Rating'].isin(['7.9'])]
s79['Gross'] = s79['Gross'].replace([0], np.mean(s79.Gross))
s80 = df[df['IMDB_Rating'].isin(['8.0'])]
s80['Gross'] = s80['Gross'].replace([0], np.mean(s80.Gross))
s81 = df[df['IMDB_Rating'].isin(['8.1'])]
s81['Gross'] = s81['Gross'].replace([0], np.mean(s81.Gross))
s82 = df[df['IMDB_Rating'].isin(['8.2'])]
s82['Gross'] = s82['Gross'].replace([0], np.mean(s82.Gross))
s83 = df[df['IMDB_Rating'].isin(['8.3'])]
s83['Gross'] = s83['Gross'].replace([0], np.mean(s83.Gross))
s84 = df[df['IMDB_Rating'].isin(['8.4'])]
s84['Gross'] = s84['Gross'].replace([0], np.mean(s84.Gross))
s85 = df[df['IMDB_Rating'].isin(['8.5'])]
s85['Gross'] = s85['Gross'].replace([0], np.mean(s85.Gross))
s86 = df[df['IMDB_Rating'].isin(['8.6'])]
s86['Gross'] = s86['Gross'].replace([0], np.mean(s86.Gross))
s87 = df[df['IMDB_Rating'].isin(['8.7'])]
s87['Gross'] = s87['Gross'].replace([0], np.mean(s87.Gross))
s88 = df[df['IMDB_Rating'].isin(['8.8'])]
s88['Gross'] = s88['Gross'].replace([0], np.mean(s88.Gross))
s89 = df[df['IMDB_Rating'].isin(['8.9'])]
s89['Gross'] = s89['Gross'].replace([0], np.mean(s89.Gross))
s90 = df[df['IMDB_Rating'].isin(['9.0'])]
s90['Gross'] = s90['Gross'].replace([0], np.mean(s90.Gross))
s92 = df[df['IMDB_Rating'].isin(['9.2'])]
s92['Gross'] = s92['Gross'].replace([0], np.mean(s92.Gross))
s93 = df[df['IMDB_Rating'].isin(['9.3'])]
s93['Gross'] = s93['Gross'].replace([0], np.mean(s93.Gross))
Saving our new dataset to a new variable df_new
¶
# concatenate all subsets
df_new = pd.concat([s76,s77,s78,s79,s80,s81,s82,s83,s84,s85,s86,s87,s88,s89,s90,s92,s93], ignore_index=True, axis=0)
df_new.shape
Show information of our new dataframe
df_new.info()
Check for duplicates
#count duplicates
print(f'Number of duplicates: {df.duplicated().sum()}')
Our new dataset is almost ready, one final step is to check for the outliers.
4. Outliers¶
There’s a lot of technique on how to identify the outliers, However the hardest part is the decision making on what to do with them. Unfortunately, there is no straightforward “best” solution for dealing with outliers “NO Free Lunch” because it depends on the severity of outliers and the goals of the analysis.
Remember, sometimes leaving out the outliers in the data is acceptable and other times they can negatively impact analysis and modeling so they should be dealt with by feature engineering. It all depends on the goals of the analysis and the severity of the outliers.
#define functions
def showoutliers(df, column_name = ""):
iqr = df[column_name].quantile(.75) - df[column_name].quantile(.25)
# lower whisker
lowerbound = (df[column_name].quantile(.25)) - iqr * 1.5
# upper whisker
upperbound = (df[column_name].quantile(.75)) + iqr * 1.5
# datapoints beyond lower whisker
lowerbound_outliers = df[df[column_name] < lowerbound]
# adtapoint beyond upper whisker
higherbound_outliers = df[df[column_name] > upperbound]
# outliers
outliers = pd.concat([lowerbound_outliers,higherbound_outliers])
return outliers
def countoutliers(df, column_name = ""):
iqr = df[column_name].quantile(.75) - df[column_name].quantile(.25)
lowerbound = (df[column_name].quantile(.25)) - iqr * 1.5
upperbound = (df[column_name].quantile(.75)) + iqr * 1.5
lowerbound_outliers = df[df[column_name] < lowerbound]
higherbound_outliers = df[df[column_name] > upperbound]
outliers = pd.concat([lowerbound_outliers,higherbound_outliers])
count = len(outliers)
return {column_name : count}
def Replace_Outliers(df_name, value, column_name = ""):
iqr = df_name[column_name].quantile(.75) - df_name[column_name].quantile(.25)
lowerbound = (df_name[column_name].quantile(.25)) - iqr * 1.5
upperbound = (df_name[column_name].quantile(.75)) + iqr * 1.5
df_name[column_name] = np.where(df_name[column_name] > upperbound, value, df_name[column_name])
df_name[column_name] = np.where(df_name[column_name] < lowerbound, value, df_name[column_name])
# create a dataset with only numeric values
df_n = df_new.select_dtypes(include=np.number)
Number of ouliers per column¶
column_list = df_n.columns
column_list = np.array(column_list)
for i in column_list:
print (countoutliers(df_n, i))
Outliers percentage¶
We likely to replace outliers that have very low proportion in our data set
for i in column_list:
col = i
perc = countoutliers(df_n, i)[i] / len(df_n)
print (col + ': ' + str('{:.2f}'.format(perc*100)) + '%')
Visualize outliers in boxplot¶
df_n.plot(kind='box',
subplots=True,
sharey=False,
figsize=(20, 7))
# increase spacing between subplots
plt.subplots_adjust(wspace=0.5)
Summary Statistic¶
df_n.describe()
Now that we have all the outliers necessary details and since we were preparing this dataset for our machine learning model, we will replace outliers with some value:
Released_Year:
replace with 25 percentileRuntime:
retainIMDB_Rating:
retainMeta_score:
retainNo_of_Votes:
retainGross:
retain
Release_Year Outliers¶
# replace released year outliers with 25 percentile
Replace_Outliers(df_new,
df_new['Released_Year'].quantile(0.25),
'Released_Year')
# uncomment for changing outliers for other columns
# Replace_Outliers(df_new, 119, 'Runtime')
# Replace_Outliers(df_new, 7.9, 'IMDB_Rating')
# Replace_Outliers(df_new, 78, 'Meta_score')
# Replace_Outliers(df_new, df_new.No_of_Votes.mean(), 'No_of_Votes')
# Replace_Outliers(df_new, df_new.Gross.mean(), 'Gross')
Correlations Before and After Replacing Outliers Value¶
# Using df_n
plt.figure(figsize=(12, 8))
corr_matrix = df_n.corr(method='pearson')
sns.heatmap(corr_matrix, annot=True)
# # Using df_new
plt.figure(figsize=(12, 8))
corr_matrix = df_new.corr(method='pearson')
sns.heatmap(corr_matrix, annot=True)
Replacing all outliers has negative impact in correlation, so we decided to leave most our data as is.
Saving our new df_new dataset in .csv file¶
# df_new.to_csv("imdb_top_1000_clean.csv")
# show cleaned dataset
df_clean = pd.read_csv('imdb_top_1000_clean.csv')
df_clean.head(3)
5. Conclusion¶
The focus of this analysis is to prepare our dataset for machine learning model. We have done the following:
- We have convert wrong data type columns to appropriate type.
- Impute null values
- impute outliers
Now we have a clean dataset saved as imdb_top_1000_clean.csv
. We can now start our analysis, univariate, bivariate, multivariate, feature selection and build a machine learning model.
-fin