Decision Tree/Random Forest

Decision_Tree

Decision Tree

Admission to masters’ degree programs (likely or unlikely?)

dt2.jpg

About the Project:

A classification prediction model(Desicion Tree) that will help students in shortlisting universities with their profiles. The predicted output gives them a fair idea about their chances for likelyhood of being admit or not.

About the Dataset:

The data contains features commonly used in determining admission to masters’ degree programs, such as GRE, GPA, and letters of recommendation. The complete list of features is summarized below:

  • GRE Scores ( out of 340 )
  • TOEFL Scores ( out of 120 )
  • University Rating ( out of 5 )
  • SOP/Statement of Purpose ( out of 5 )
  • LOR/Letter of Recommendation ( out of 5 )
  • Undergraduate GPA ( out of 10 )
  • Research Experience ( either 0 or 1 )
  • Chance of Admit ( ranging from 0 to 1 )

Content:

  1. Import Libraries and Load Dataset
  2. Preliminary Decision Tree
  3. Evaluation
  4. Decision Tree Pruning
  5. Pruned Decission Tree
  6. Random Forest Implementation
  7. Conclusion

1. Import Libraries and Load Dataset

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.metrics import accuracy_score,confusion_matrix, plot_confusion_matrix
from sklearn import tree
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')
C:\Users\Toto\anaconda3\lib\site-packages\scipy\__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.1
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
In [2]:
df = pd.read_csv("Admission_Predict.csv")
print(df.shape)
df.head(2)
(400, 9)
Out[2]:
Serial No. GRE Score TOEFL Score University Rating SOP LOR CGPA Research Chance of Admit
0 1 337 118 4 4.5 4.5 9.65 1 0.92
1 2 324 107 4 4.0 4.5 8.87 1 0.76
In [3]:
# Cleaning columns names
df.columns = df.columns.str.strip().str.replace(' ','_').str.lower()
df.head(1)
Out[3]:
serial_no. gre_score toefl_score university_rating sop lor cgpa research chance_of_admit
0 1 337 118 4 4.5 4.5 9.65 1 0.92

2. Preliminary Decision Tree

NOTE: The decision tree does not support categorical data as features.

As a first step, we will create a binary class (1=admission likely , 0=admission unlikely) from the chance of admit – greater than 79% we will consider as likely. The remaining data columns will be used as predictors.

In [4]:
# Define X and y
X = df.loc[:,'gre_score':'research']
y = df['chance_of_admit']>=.79
In [5]:
# Data split
x_train, x_test, y_train, y_test = train_test_split(X,y, random_state=0, test_size=0.3)

Note:

In most decision tree, we are just comparing stuff and branching down the tree, so normalization would not help here.

Decision Tree Classifier Object

In [6]:
# parameters are all on default 
dt = DecisionTreeClassifier(random_state=0)
In [7]:
# Fit training data
dt.fit(x_train, y_train)
Out[7]:
DecisionTreeClassifier(random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Visualization

In [8]:
plt.figure(figsize=(15, 7.5))

tree.plot_tree(dt, 
               feature_names = x_train.columns,
               max_depth=None,
               class_names = ['unlikely admit', 'likley admit'],
               label='root',
               filled=True,
               rounded = True)
plt.show()

Questions

  1. How does decision tree chooses it’s root node?
  2. How was the split cgpa <= 8.845 determined?
  • Decision tree iterate through each features and compare for it’s GINI Impurity. The lowest GINI impurity will be the root node.

  • Cgpa is a continuous variable, which adds an extra complication, as the split can occur for ANY value of cgpa. We can also do a Feature Engineering like encoding this cgpa to a ‘pass’ or ‘fail’ which will make it easier.

  • To verify, we will use the defined functions gini and info_gain. By running gini(y_train), we get the same Gini impurity value as printed in the tree at the root node which is 0.461.

In [9]:
# Gini function
def gini(data):
    """Calculate the Gini Impurity Score
    """
    data = pd.Series(data)
    return 1 - sum(data.value_counts(normalize=True)**2)
In [10]:
gi = gini(y_train)
# gini impurity at root node
gi
Out[10]:
0.46119897959183676

Next, we are going to verify how the split on cgpa was determined, i.e. where did the 8.845 value come from. We will use ‘info_gain function’ over ALL values of cgpa to determine the information gain when split on each value.

This will be stored in a table and sorted, and voila, the top value for the split is cgpa <= 8.735! This is also done for every other feature (and for those continuous ones, every value), to find the top split overall.

In [11]:
# Info_gain function
def info_gain(left, right, current_impurity):
    """Information Gain associated with creating a node/split data.
    Input: left, right are data in left branch, right banch, respectively
    current_impurity is the data impurity before splitting into left, right branches
    """
    # weight for gini score of the left branch
    w = float(len(left)) / (len(left) + len(right))
    return current_impurity - w * gini(left) - (1 - w) * gini(right)
In [12]:
# Inforamation gain list
info_gain_list = []
for i in x_train.cgpa.unique():
    left = y_train[x_train.cgpa <= i]
    right = y_train[x_train.cgpa > i]
    
    # here we call the fucntion 'info_gain' and ''gini.
    info_gain_list.append([i, info_gain(left, right, gi)])

# show info_gain_list
# print(info_gain_list)

# convert to dataframe for better viewing
ig_table = pd.DataFrame(info_gain_list, columns=['split_value', 'info_gain']).sort_values('info_gain',ascending=False)
ig_table.head(3)
Out[12]:
split_value info_gain
119 8.84 0.316617
80 8.83 0.310835
111 8.85 0.310549

To summarize, our lowest gini impurity is at cgpa that has 0.461 impurity level and our determined split value is 8.84, the highest info_gain.

Visualizing split value vs. information gain

In [13]:
# Visualise information gain
plt.plot(ig_table['split_value'], ig_table['info_gain'],'o')
plt.plot(ig_table['split_value'].iloc[0], ig_table['info_gain'].iloc[0],'r*')
plt.xlabel('cgpa split value')
plt.ylabel('info gain')
Out[13]:
Text(0, 0.5, 'info gain')

3. Evaluation

Model Accuarcy

In [14]:
y_pred = dt.predict(x_test)
# print(dt.score(x_test, y_test)) # .score is the same as accuracy_score
print(accuracy_score(y_test, y_pred))
0.8
In [15]:
# Confusion Matrix
model_matrix = confusion_matrix(y_test, y_pred)
print(model_matrix)

# Visualize
fig, ax = plt.subplots(figsize=(8,5))

# setting variables
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ['{0:0.0f}'.format(value) for value in model_matrix.flatten()]
group_percentages = ['{0:.2%}'.format(value) for value in model_matrix.flatten()/np.sum(model_matrix)]
labels = [f'{v1}\n{v2}\n{v3}' for v1, v2, v3 in zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)

sns.heatmap(model_matrix, annot=labels, fmt='', cmap='Purples')

# code using plot_confusion_matrix
# uncomment to show
# plot_confusion_matrix(dt, x_test, y_test, display_labels =['Unlikely Admit', 'Likely Admit'], cmap='Purples')
[[69 12]
 [12 27]]
Out[15]:
<AxesSubplot:>

4. Decision Tree Pruning

Best CCP_aplha

In [16]:
pruning = dt.cost_complexity_pruning_path(x_train, y_train)
# pruning

ccp_alphas = pruning.ccp_alphas
impurities = pruning.impurities
# ccp_alphas
In [17]:
dt_list = []
for i in ccp_alphas:
    # random_state = 1 for checking
    dt= DecisionTreeClassifier(random_state=1, ccp_alpha= i)
    dt.fit(x_train, y_train)
    dt_list.append(dt)

# dt_list
In [18]:
train_score = [dt.score(x_train, y_train) for dt in dt_list]
test_score  = [dt.score(x_test, y_test) for dt in dt_list]
In [19]:
plt.figure(figsize=(15, 5)) 

ax1 = plt.subplot(1,2,1)
plt.plot(ccp_alphas, train_score, marker='o', label='Train Score', drawstyle='steps-post')
plt.plot(ccp_alphas, test_score, marker='o', label='Test Score', drawstyle='steps-post')

ax1.set_title('CCP Alphas vs. Accuracy for Training and Testing Dataset')
ax1.set_xlabel('CCP Alphas')
ax1.set_ylabel('Accuracy')
ax1.legend()

#### ZOOM IN 
ax2 = plt.subplot(1,2,2)
plt.plot(ccp_alphas, train_score, marker='o', label='Train Score', drawstyle='steps-post')
plt.plot(ccp_alphas, test_score, marker='o', label='Test Score', drawstyle='steps-post')
ax2.legend()
ax2.set_title('ZOOM IN')
plt.xlim(.0,.015)
Out[19]:
(0.0, 0.015)

The graph show us that CCP_alpha .013 might be the best one to pick. But how does this alpha perform in different train/test dataset?

Cross Validation using Best CCP Alpha

Testing our new found alpha 0.013.

In [20]:
# use random state = 0 , same state as our original tree 
# were just applying our new ccp_alpha here
dt= DecisionTreeClassifier(random_state=0, ccp_alpha= .013)
In [21]:
# 5 fold cross validation
scores = cross_val_score(dt, x_train, y_train,cv=5)
scores
Out[21]:
array([0.91071429, 0.92857143, 0.89285714, 0.91071429, 0.85714286])
In [22]:
df_cv = pd.DataFrame(data={'tree': range(5), 'accuracy': scores})
df_cv
Out[22]:
tree accuracy
0 0 0.910714
1 1 0.928571
2 2 0.892857
3 3 0.910714
4 4 0.857143
In [23]:
df_cv.plot('tree', 'accuracy', linestyle='--', marker='o' )
Out[23]:
<AxesSubplot:xlabel='tree'>

The graph shows using different training and testing dataset with same alpha(0.013), accuracy scores are fluctuating from 0.86 up to 0.93 acccuracy score.

Check again the different alphas with cross validation

Using 5 fold validatin

In [24]:
alpha_cv = []

for i in ccp_alphas:
    # random_state = 1, for checking
    dt= DecisionTreeClassifier(random_state=1, ccp_alpha= i)
    scores = cross_val_score(dt, x_train, y_train,cv=5)
    alpha_cv.append([i, np.mean(scores), np.std(scores)])
In [25]:
df_cv2 = pd.DataFrame(alpha_cv, columns=['alpha', 'accuracy_mean', 'std'])
# remove last row (outliers)
df_cv2  = df_cv2[:-1]
df_cv2.head(3)
Out[25]:
alpha accuracy_mean std
0 0.000000 0.907143 0.013363
1 0.002350 0.921429 0.021429
2 0.002857 0.921429 0.021429
In [26]:
df_cv2.plot(x = 'alpha',
            y = 'accuracy_mean',
            yerr = 'std',
            linestyle = '--',
            marker = 'o' )
Out[26]:
<AxesSubplot:xlabel='alpha'>
In [27]:
df_cv2.sort_values('accuracy_mean',ascending=False).head(5)
Out[27]:
alpha accuracy_mean std
3 0.003459 0.928571 0.015972
4 0.003487 0.928571 0.015972
5 0.004898 0.928571 0.022588
6 0.005952 0.921429 0.018211
1 0.002350 0.921429 0.021429

Using cross validation, instead of using alpha 0.013 we can use the alpha value at index 4 which might be better in overall performance.

5. Pruned Decision Tree

Decision Tree with alpha 0.003487

In [28]:
# note: I play with different max_depth value here to get the best accuracy score.
# but we can also do a 'for loop' if necessary.
dt = DecisionTreeClassifier(random_state=0, ccp_alpha = 0.003487, max_depth=3)
dt.fit(x_train, y_train)
Out[28]:
DecisionTreeClassifier(ccp_alpha=0.003487, max_depth=3, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [29]:
plt.figure(figsize=(15, 7.5))
tree.plot_tree(dt, feature_names = x_train.columns,  
               max_depth=None, class_names = ['unlikely admit', 'likely admit'],
               label='root', filled=True, rounded = True)
plt.show()

Decission Tree Interpretation

  • Left side = True
  • Right side = False

Unlikely admit

  1. CGPA less than or equal to 8.845 are unlikely to admit.
  2. University rating is less than or equal to 2.0

Likely Admit

  1. CGPA higher than 8.845 need their Letter of Recomendation(LOR) score to be higher than 3.25 given that LOR is from a high rating university(univerty rating must be greater than 2.0).
  2. GCPA with score above 9.065 is excepted from low LOR score.
  3. For CGPA score between 8.846 and 9.065 need high LOR score (higher than 3.25).
In [30]:
y_pred = dt.predict(x_test)
# print(dt.score(x_test, y_test)) # .score is the same as accuracy_score
print(accuracy_score(y_test, y_pred))
0.85
In [31]:
# Confusion Matrix
model_matrix = confusion_matrix(y_test, y_pred)
print(model_matrix)

# Visualize
fig, ax = plt.subplots(figsize=(8,5))

# setting variables
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ['{0:0.0f}'.format(value) for value in model_matrix.flatten()]
group_percentages = ['{0:.2%}'.format(value) for value in model_matrix.flatten()/np.sum(model_matrix)]
labels = [f'{v1}\n{v2}\n{v3}' for v1, v2, v3 in zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)

sns.heatmap(model_matrix, annot=labels, fmt='', cmap='Purples')
[[74  7]
 [11 28]]
Out[31]:
<AxesSubplot:>

6.Random Forest Implementation

Intead of using 1 tree, we can use 10 or 100 more trees. Each tree will have a vote on weather each point are true or false then the majority of votes will be our final output.

In [124]:
from sklearn.ensemble import RandomForestClassifier
In [125]:
rf = RandomForestClassifier(n_estimators=10, max_depth=3) 
In [126]:
x_train2, x_test2, y_train2, y_test2 = train_test_split(X,y, random_state=0, test_size=0.3)
rf.fit(x_train2, y_train2)
Out[126]:
RandomForestClassifier(max_depth=3, n_estimators=10)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [127]:
y_pred2 = rf.predict(x_test2)
model_matrix2 = confusion_matrix(y_test2, y_pred2)
print(model_matrix2)
[[75  6]
 [ 7 32]]
In [128]:
rf.score(x_test2, y_test2)
Out[128]:
0.8916666666666667

Visualize random forest (all trees)

In [129]:
# for i in range(len(rf.estimators_)):
#         plt.figure(figsize = (15,15))
#         tree.plot_tree(rf.estimators_[i] , filled =True)
#         plt.show()

7. Conclusion

===== WIP =======

  • fin

Leave a Reply