Classification¶
Survival prediction of patients with heart failure.
Data Source:
Heart Failure
In this project, we will use a dataset from Kaggle to predict the survival of patients with heart failure from serum creatinine and ejection fraction, and other factors such as age, anemia, diabetes, and so on.
About The Dataset:
Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worlwide. Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure.
Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.
People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.
Data Dictionary:
- Feature details
- Boolean features
- Sex – 0 = Female, Male = 1
- Diabetes – 0 = No, 1 = Yes
- Anaemia – 0 = No, 1 = Yes
- High_blood_pressure – 0 = No, 1 = Yes
- Smoking – 0 = No, 1 = Yes
- DEATH_EVENT – 0 = No, 1 = Yes
- Other informations
   – mcg/L: micrograms per liter.
- mL: microliter.
- mEq/L: milliequivalents per litre
- The time feature seams to be highly correlated to the death event but there is no concret information of how this metric was measured patient by patient. Which makes it hard to use it in the analysis.
Load Libraries and Data Inspection¶
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, f1_score, classification_report
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, InputLayer, Dropout
from tensorflow.keras.constraints import MaxNorm
from tensorflow.keras import layers
from tensorflow.keras.callbacks import EarlyStopping
from scikeras.wrappers import KerasClassifier
from sklearn.model_selection import GridSearchCV
# from collections import Counter
from tensorflow.keras.utils import to_categorical
from sklearn.compose import ColumnTransformer
df = pd.read_csv('heart_failure.csv')
print(df.shape)
df.head()
In this dataset, we will use column death_event
as our class label.
Class Distribution (death_event)¶
Only 32% of our samples failed to survive given of heart failure.
print(df['death_event'].value_counts())
# proportion
df['death_event'].value_counts() / len(df)
We have a very small and slightly imbalance dataset
. We can’t confidently use accuracy as our scoring method.
note:
For unbalance dataset we must use recall, precision and f1 for scoring.
recall:
number of correct prediction out of total Actual Positive. usecase if we need to reduce the False Negative like in cancer detection, we dont want our model to predict a person to be negative while the truth is positive.
precision:
number of positive in Actual out total Positive prediction. i.e in cases we need to reduce the False Positives like in spam detection, we dont want to predict an important mail to be a spam.
General Information¶
df.info()
Anaemia, diabets, high_blood_pressure, sex, and smoking are in ‘str’ datatype. In TensorFlow with Keras, we needs to convert all the categorical features and labels into one-hot encoding vectors.
Define X and y¶
Lets define first our class and features. One advantage of Neural Network to traditional Machine Learning algorithm is that we don’t need to do a feature selection, Neural Network learn from it’s data and adjust features weight accordingly.
y = df['death_event']
X = df.loc[:,'age':'time']
X.head(3)
One-Hot Encoding¶
One hot encode our X features. Convert categorical features (anaemia, diabets, high_blood_pressure, sex, and smoking are) to one-hot encoding vectors and assign the result back to variable X.
X = pd.get_dummies(X)
X.head(3)
Split data¶
Splitting our train and test data to 70:30 ratio.
x_train, x_test, y_train, y_test = train_test_split(X, y,
test_size = 0.3,
random_state = 7
)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)
Standardized¶
Standardized our features.
ct = ColumnTransformer([("numeric", StandardScaler(),
x_train.columns
)])
print(ct)
x_train = ct.fit_transform(x_train)
x_test = ct.transform(x_test)
Label Encoder¶
Convert the class/label to integers ranging from 0 to the number of classes.
lencoder = LabelEncoder()
y_train = lencoder.fit_transform(y_train.astype(str))
y_test = lencoder.fit_transform(y_test.astype(str))
# Show mapping
class_mapping = {l: i for i, l in enumerate(lencoder.classes_)}
class_mapping
Re-shape to 2d array¶
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
Designing a Deep Learning Model For Classification¶
The model is optimized using the adam
optimizer and seeks to minimize the categorical cross-entropy
loss. Using Recall
as our metric. We need to reduce the False Negative rate because we dont want our model to predict a person to be negative while the truth is positive.
model = Sequential()
# input layer
model.add(InputLayer(input_shape=(x_train.shape[1],)))
# hidden layer1
model.add(Dense(12, activation='relu'))
# output layer
# supposedly we have to use 'sigmoid' and 'binary_crossentrophy' here
# but I will try using softmax and input two neurons(we have two classes).
model.add(Dense(2, activation='softmax'))
# compile
# I want to monitor the recall score for this instance
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['Recall'])
# fix random seed for reproducibility
seed = 7
tf.random.set_seed(seed)
Early Stopping¶
# reference https://keras.io/api/callbacks/early_stopping/
stop = EarlyStopping(monitor='val_loss',
patience=20,
verbose=1)
Training Phase¶
# # fix random seed for reproducibility
tf.random.set_seed(7)
h = model.fit(x_train, y_train,
validation_data= (x_test, y_test),
epochs=300,
batch_size= 64,
verbose=1,
callbacks=[stop]
)
Note:
We will get slightly different results if we re-run the training phase multiple times due to algorithm nature (callbacks and validation data).
Consider running the training phase a few times and compare the average outcome.
h.history.keys()
Plotting for Recall Score¶
#plotting for Recall
fig, axs = plt.subplots(1,2,
figsize=(15, 6),
gridspec_kw={'hspace': 0.5, 'wspace': 0.2})
(ax1, ax2) = axs
# Binary Entrophy
ax1.plot(h.history['loss'], label='Train')
ax1.plot(h.history['val_loss'], label='Validation')
ax1.set_title('Categorical cross entrophy for train vs validation set')
ax1.legend(loc="upper right")
ax1.set_xlabel("# of epochs")
ax1.set_ylabel("Categorical Cross Entrophy")
# Recall
ax2.plot(h.history['recall'], label='Train')
ax2.plot(h.history['val_recall'], label='Validation')
ax2.set_title('Recall for train vs validation set')
ax2.legend(loc="upper right")
ax2.set_xlabel("# of epochs")
ax2.set_ylabel("Recall")
Accuracy¶
loss, acc = model.evaluate(x_test, y_test, verbose=0)
print("Loss", loss, "Accuracy:", acc)
Evaluation¶
y_pred = model.predict(x_test, verbose=0)
y_pred = np.argmax(y_pred, axis=1)
y_pred
y_true = np.argmax(y_test, axis=1)
y_true
print(classification_report(y_true, y_pred))
model_matrix = confusion_matrix(y_true, y_pred)
model_matrix
# Visualize
fig, ax = plt.subplots(figsize=(8,5))
# setting variables
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ['{0:0.0f}'.format(value) for value in model_matrix.flatten()]
group_percentages = ['{0:.2%}'.format(value) for value in model_matrix.flatten()/np.sum(model_matrix)]
labels = [f'{v1}\n{v2}\n{v3}' for v1, v2, v3 in zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(model_matrix, annot=labels, fmt='', cmap='Purples')
Recall2 = recall_score(y_true, y_pred)
Recall2
After playing with different hyperparameters combination value, we were able to produce a model thats not overfit nor underfit and at the same time we minimize the recall score.
Let’s try using GridSearchCV if we can reproduce the same hyperpameter and recall score.
Hyperpameter Tunning¶
We will use GridSearchCV to find the optimal hyperparameter value, however using this method is very CPU computational expensive (there’s a technique that can improve training/gridsearch speed but we will not do it here). We will do search each hyperameter individually with minimum search value.
Data Preparation for GridSearchCV¶
df = pd.read_csv('heart_failure.csv')
y = df['death_event']
X = df.loc[:,'age':'time']
#
X = pd.get_dummies(X)
#
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
#
ct = ColumnTransformer([("numeric", StandardScaler(),
X_train.columns
)])
#
X_train = ct.fit_transform(X_train)
X_test = ct.transform(X_test)
#
le= LabelEncoder()
Y_train = le.fit_transform(Y_train.astype(str))
Y_test = le.transform(Y_test.astype(str))
Batch_size and Epoch¶
Batch size define how many patterns to read at a time and keep in memory. The number of epochs is the number of times the entire training dataset is shown to the network during training
# Function to create model, required for KerasClassifier
def create_model():
model = Sequential()
model.add(InputLayer(input_shape=(X_train.shape[1])))
model.add(Dense(12, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
# fix random seed for reproducibility
seed = 7
tf.random.set_seed(seed)
# create model
model = KerasClassifier(model=create_model, verbose=0)
Note: We will do small search for epoches and batch sizes here. Unfotunately my computer can’t handle the computational power of searching with big data.
But I did gridsearch this in google collab. Here’s the link for reference:
# define the grid search parameters
# See in google collab
# batch_size = [X_train.shape[1], 64, 100]
# epochs = [50, 80, 100]
# cv = 3
batch_size = [x_train.shape[1], 64]
epochs = [50, 80]
param_grid = dict(
batch_size = batch_size,
epochs = epochs
)
grid = GridSearchCV(estimator = model,
param_grid = param_grid,
n_jobs=-1,
verbose=1,
cv=2)
grid_result = grid.fit(X_train, Y_train)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))
Number of Neurons in the Hidden Layer¶
The number of neurons in a layer is an important parameter to tune. Generally the number of neurons in a layer controls the representational capacity of the network, at least at that point in the topology.
Also, generally, a large enough single layer network can approximate any other neural network, at least in theory.
# Function for number of neurons in hidden layer 1
def create_model_neurons(neurons):
model = Sequential()
model.add(InputLayer(input_shape=(X_train.shape[1])))
model.add(Dense(neurons, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
# fix random seed for reproducibility
seed = 7
tf.random.set_seed(seed)
# create model
model = KerasClassifier(model = create_model_neurons,
epochs = 80,
batch_size = 17,
verbose=0)
# define the grid search parameters
neurons = [1, 5, 10, 15, 20]
param_grid = dict(
model__neurons = neurons
)
grid = GridSearchCV(estimator=model,
param_grid = param_grid,
n_jobs=-1,
cv=2)
grid_result = grid.fit(X_train, Y_train)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))
Dropout Regularization¶
Limit overfitting and improve the model’s ability to generalize. For the best results, dropout is best combined with a weight constraint such as the max norm constraint.
# Function for dropout value in hidden layer 1
def create_model_dropout(dropout_rate, weight_constraint):
model = Sequential()
# input layer
model.add(InputLayer(input_shape=(X_train.shape[1])))
# hidden layer1
model.add(Dense(15, activation ='relu', kernel_constraint=MaxNorm(weight_constraint)))
model.add(Dropout(dropout_rate))
#output layer
model.add(Dense(1, activation='sigmoid'))
# compile
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
# fix random seed for reproducibility
seed = 7
tf.random.set_seed(seed)
# create model
model = KerasClassifier(model = create_model_dropout,
epochs = 80,
batch_size = 17,
verbose=0)
# define the grid search parameters
# weight_constraint = [1.0, 2.0, 3.0, 4.0, 5.0]
# dropout_rate = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
weight_constraint = [1.0, 2.0]
dropout_rate = [0.0, 0.1, 0.2]
param_grid = dict(
model__dropout_rate = dropout_rate,
model__weight_constraint = weight_constraint
)
grid = GridSearchCV(estimator = model,
param_grid = param_grid,
n_jobs = -1,
cv=2)
grid_result = grid.fit(X_train, Y_train)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))
Neuron Activation Function¶
activation = [‘softmax’, ‘softplus’, ‘softsign’, ‘relu’, ‘tanh’, ‘sigmoid’, ‘hard_sigmoid’, ‘linear’]
–WIP–
Learning Rate and Momentum¶
–WIP–
To be continue…