Decision Tree¶
Admission to masters’ degree programs (likely or unlikely?)
Data Source: Admission to Master’s Degree
About the Project:
A classification prediction model(Desicion Tree) that will help students in shortlisting universities with their profiles. The predicted output gives them a fair idea about their chances for likelyhood of being admit or not.
About the Dataset:
The data contains features commonly used in determining admission to masters’ degree programs, such as GRE, GPA, and letters of recommendation. The complete list of features is summarized below:
- GRE Scores ( out of 340 )
- TOEFL Scores ( out of 120 )
- University Rating ( out of 5 )
- SOP/Statement of Purpose ( out of 5 )
- LOR/Letter of Recommendation ( out of 5 )
- Undergraduate GPA ( out of 10 )
- Research Experience ( either 0 or 1 )
- Chance of Admit ( ranging from 0 to 1 )
Content:
- Import Libraries and Load Dataset
- Preliminary Decision Tree
- Evaluation
- Decision Tree Pruning
- Pruned Decission Tree
- Random Forest Implementation
- Conclusion
1. Import Libraries and Load Dataset¶
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.metrics import accuracy_score,confusion_matrix, plot_confusion_matrix
from sklearn import tree
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv("Admission_Predict.csv")
print(df.shape)
df.head(2)
# Cleaning columns names
df.columns = df.columns.str.strip().str.replace(' ','_').str.lower()
df.head(1)
2. Preliminary Decision Tree¶
NOTE: The decision tree does not support categorical data as features.
As a first step, we will create a binary class (1=admission likely , 0=admission unlikely) from the chance of admit – greater than 79% we will consider as likely. The remaining data columns will be used as predictors.
# Define X and y
X = df.loc[:,'gre_score':'research']
y = df['chance_of_admit']>=.79
# Data split
x_train, x_test, y_train, y_test = train_test_split(X,y, random_state=0, test_size=0.3)
Note:
In most decision tree, we are just comparing stuff and branching down the tree, so normalization would not help here.
Decision Tree Classifier Object¶
# parameters are all on default
dt = DecisionTreeClassifier(random_state=0)
# Fit training data
dt.fit(x_train, y_train)
Visualization¶
plt.figure(figsize=(15, 7.5))
tree.plot_tree(dt,
feature_names = x_train.columns,
max_depth=None,
class_names = ['unlikely admit', 'likley admit'],
label='root',
filled=True,
rounded = True)
plt.show()
Questions
- How does decision tree chooses it’s root node?
- How was the split cgpa <= 8.845 determined?
Decision tree iterate through each features and compare for it’s GINI Impurity. The lowest GINI impurity will be the root node.
Cgpa is a continuous variable, which adds an extra complication, as the split can occur for ANY value of cgpa. We can also do a Feature Engineering like encoding this cgpa to a ‘pass’ or ‘fail’ which will make it easier.
To verify, we will use the defined functions
gini
andinfo_gain
. By running gini(y_train), we get the same Gini impurity value as printed in the tree at the root node which is 0.461.
# Gini function
def gini(data):
"""Calculate the Gini Impurity Score
"""
data = pd.Series(data)
return 1 - sum(data.value_counts(normalize=True)**2)
gi = gini(y_train)
# gini impurity at root node
gi
Next, we are going to verify how the split on cgpa was determined, i.e. where did the 8.845 value come from. We will use ‘info_gain function’ over ALL values of cgpa to determine the information gain when split on each value.
This will be stored in a table and sorted, and voila, the top value for the split is cgpa <= 8.735! This is also done for every other feature (and for those continuous ones, every value), to find the top split overall.
# Info_gain function
def info_gain(left, right, current_impurity):
"""Information Gain associated with creating a node/split data.
Input: left, right are data in left branch, right banch, respectively
current_impurity is the data impurity before splitting into left, right branches
"""
# weight for gini score of the left branch
w = float(len(left)) / (len(left) + len(right))
return current_impurity - w * gini(left) - (1 - w) * gini(right)
# Inforamation gain list
info_gain_list = []
for i in x_train.cgpa.unique():
left = y_train[x_train.cgpa <= i]
right = y_train[x_train.cgpa > i]
# here we call the fucntion 'info_gain' and ''gini.
info_gain_list.append([i, info_gain(left, right, gi)])
# show info_gain_list
# print(info_gain_list)
# convert to dataframe for better viewing
ig_table = pd.DataFrame(info_gain_list, columns=['split_value', 'info_gain']).sort_values('info_gain',ascending=False)
ig_table.head(3)
To summarize, our lowest gini impurity is at cgpa that has 0.461 impurity level and our determined split value is 8.84, the highest info_gain.
Visualizing split value vs. information gain¶
# Visualise information gain
plt.plot(ig_table['split_value'], ig_table['info_gain'],'o')
plt.plot(ig_table['split_value'].iloc[0], ig_table['info_gain'].iloc[0],'r*')
plt.xlabel('cgpa split value')
plt.ylabel('info gain')
3. Evaluation¶
Model Accuarcy
y_pred = dt.predict(x_test)
# print(dt.score(x_test, y_test)) # .score is the same as accuracy_score
print(accuracy_score(y_test, y_pred))
# Confusion Matrix
model_matrix = confusion_matrix(y_test, y_pred)
print(model_matrix)
# Visualize
fig, ax = plt.subplots(figsize=(8,5))
# setting variables
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ['{0:0.0f}'.format(value) for value in model_matrix.flatten()]
group_percentages = ['{0:.2%}'.format(value) for value in model_matrix.flatten()/np.sum(model_matrix)]
labels = [f'{v1}\n{v2}\n{v3}' for v1, v2, v3 in zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(model_matrix, annot=labels, fmt='', cmap='Purples')
# code using plot_confusion_matrix
# uncomment to show
# plot_confusion_matrix(dt, x_test, y_test, display_labels =['Unlikely Admit', 'Likely Admit'], cmap='Purples')
4. Decision Tree Pruning¶
Best CCP_aplha¶
pruning = dt.cost_complexity_pruning_path(x_train, y_train)
# pruning
ccp_alphas = pruning.ccp_alphas
impurities = pruning.impurities
# ccp_alphas
dt_list = []
for i in ccp_alphas:
# random_state = 1 for checking
dt= DecisionTreeClassifier(random_state=1, ccp_alpha= i)
dt.fit(x_train, y_train)
dt_list.append(dt)
# dt_list
train_score = [dt.score(x_train, y_train) for dt in dt_list]
test_score = [dt.score(x_test, y_test) for dt in dt_list]
plt.figure(figsize=(15, 5))
ax1 = plt.subplot(1,2,1)
plt.plot(ccp_alphas, train_score, marker='o', label='Train Score', drawstyle='steps-post')
plt.plot(ccp_alphas, test_score, marker='o', label='Test Score', drawstyle='steps-post')
ax1.set_title('CCP Alphas vs. Accuracy for Training and Testing Dataset')
ax1.set_xlabel('CCP Alphas')
ax1.set_ylabel('Accuracy')
ax1.legend()
#### ZOOM IN
ax2 = plt.subplot(1,2,2)
plt.plot(ccp_alphas, train_score, marker='o', label='Train Score', drawstyle='steps-post')
plt.plot(ccp_alphas, test_score, marker='o', label='Test Score', drawstyle='steps-post')
ax2.legend()
ax2.set_title('ZOOM IN')
plt.xlim(.0,.015)
The graph show us that CCP_alpha .013 might be the best one to pick. But how does this alpha perform in different train/test dataset?
Cross Validation using Best CCP Alpha¶
Testing our new found alpha 0.013.
# use random state = 0 , same state as our original tree
# were just applying our new ccp_alpha here
dt= DecisionTreeClassifier(random_state=0, ccp_alpha= .013)
# 5 fold cross validation
scores = cross_val_score(dt, x_train, y_train,cv=5)
scores
df_cv = pd.DataFrame(data={'tree': range(5), 'accuracy': scores})
df_cv
df_cv.plot('tree', 'accuracy', linestyle='--', marker='o' )
The graph shows using different training and testing dataset with same alpha(0.013), accuracy scores are fluctuating from 0.86 up to 0.93 acccuracy score.
Check again the different alphas with cross validation¶
Using 5 fold validatin
alpha_cv = []
for i in ccp_alphas:
# random_state = 1, for checking
dt= DecisionTreeClassifier(random_state=1, ccp_alpha= i)
scores = cross_val_score(dt, x_train, y_train,cv=5)
alpha_cv.append([i, np.mean(scores), np.std(scores)])
df_cv2 = pd.DataFrame(alpha_cv, columns=['alpha', 'accuracy_mean', 'std'])
# remove last row (outliers)
df_cv2 = df_cv2[:-1]
df_cv2.head(3)
df_cv2.plot(x = 'alpha',
y = 'accuracy_mean',
yerr = 'std',
linestyle = '--',
marker = 'o' )
df_cv2.sort_values('accuracy_mean',ascending=False).head(5)
Using cross validation, instead of using alpha 0.013 we can use the alpha value at index 4 which might be better in overall performance.
5. Pruned Decision Tree¶
Decision Tree with alpha 0.003487
# note: I play with different max_depth value here to get the best accuracy score.
# but we can also do a 'for loop' if necessary.
dt = DecisionTreeClassifier(random_state=0, ccp_alpha = 0.003487, max_depth=3)
dt.fit(x_train, y_train)
plt.figure(figsize=(15, 7.5))
tree.plot_tree(dt, feature_names = x_train.columns,
max_depth=None, class_names = ['unlikely admit', 'likely admit'],
label='root', filled=True, rounded = True)
plt.show()
Decission Tree Interpretation
- Left side = True
- Right side = False
Unlikely admit
- CGPA less than or equal to 8.845 are unlikely to admit.
- University rating is less than or equal to 2.0
Likely Admit
- CGPA higher than 8.845 need their Letter of Recomendation(LOR) score to be higher than 3.25 given that LOR is from a high rating university(univerty rating must be greater than 2.0).
- GCPA with score above 9.065 is excepted from low LOR score.
- For CGPA score between 8.846 and 9.065 need high LOR score (higher than 3.25).
y_pred = dt.predict(x_test)
# print(dt.score(x_test, y_test)) # .score is the same as accuracy_score
print(accuracy_score(y_test, y_pred))
# Confusion Matrix
model_matrix = confusion_matrix(y_test, y_pred)
print(model_matrix)
# Visualize
fig, ax = plt.subplots(figsize=(8,5))
# setting variables
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ['{0:0.0f}'.format(value) for value in model_matrix.flatten()]
group_percentages = ['{0:.2%}'.format(value) for value in model_matrix.flatten()/np.sum(model_matrix)]
labels = [f'{v1}\n{v2}\n{v3}' for v1, v2, v3 in zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(model_matrix, annot=labels, fmt='', cmap='Purples')
6.Random Forest Implementation¶
Intead of using 1 tree, we can use 10 or 100 more trees. Each tree will have a vote on weather each point are true or false then the majority of votes will be our final output.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=10, max_depth=3)
x_train2, x_test2, y_train2, y_test2 = train_test_split(X,y, random_state=0, test_size=0.3)
rf.fit(x_train2, y_train2)
y_pred2 = rf.predict(x_test2)
model_matrix2 = confusion_matrix(y_test2, y_pred2)
print(model_matrix2)
rf.score(x_test2, y_test2)
Visualize random forest (all trees)
# for i in range(len(rf.estimators_)):
# plt.figure(figsize = (15,15))
# tree.plot_tree(rf.estimators_[i] , filled =True)
# plt.show()
7. Conclusion¶
===== WIP =======
- fin