Logistic Regression¶
Classification for obesity.
Data source: Obesity survey based on eating habits
In this project, We will analyze data from a survey conducted by Fabio Mendoza Palechor and Alexis de la Hoz Manotas that asked people about their eating habits and weight. The data was obtained from the [UCI Machine Learning Repository]. Categorical variables were changed to numerical ones in order to facilitate analysis.
First, We will fit a logistic regression model to try to predict whether survey respondents are obese based on their answers to questions in the survey. After that, We will use three different wrapper methods to choose a smaller feature subset.
We will be using sequential forward selection
, sequential backward floating selection
, and recursive feature elimination
. After implementing each wrapper method, We then evaluate the model accuracy on the resulting smaller feature subsets and compare that with the model accuracy using all available features.
Data Dictionary
The data set obesity
contains 18 predictor variables. Here’s a brief description of them.
Gender
is1
if a respondent is male and0
if a respondent is female.Age
is a respondent’s age in years.family_history_with_overweight
is1
if a respondent has family member who is or was overweight,0
if not.FAVC
is1
if a respondent eats high caloric food frequently,0
if not.FCVC
is1
if a respondent usually eats vegetables in their meals,0
if not.NCP
represents how many main meals a respondent has daily (0
for 1-2 meals,1
for 3 meals, and2
for more than 3 meals).CAEC
represents how much food a respondent eats between meals on a scale of0
to3
.SMOKE
is1
if a respondent smokes,0
if not.CH2O
represents how much water a respondent drinks on a scale of0
to2
.SCC
is1
if a respondent monitors their caloric intake,0
if not.FAF
represents how much physical activity a respondent does on a scale of0
to3
.TUE
represents how much time a respondent spends looking at devices with screens on a scale of0
to2
.CALC
represents how often a respondent drinks alcohol on a scale of0
to3
.Automobile
,Bike
,Motorbike
,Public_Transportation
, andWalking
indicate a respondent’s primary mode of transportation. Their primary mode of transportation is indicated by a1
and the other columns will contain a0
.
The outcome variable, NObeyesdad
, is a 1
if a patient is obese and a 0
if not.
Content¶
- Import and Load Libraries
- Preliminary Logistic Regression
- 2.1 Building a Logistic Regression Model
- 2.2 Model Evaluation
- Feature Selection
- 3.1 Wrapper Methods
- 3.2 Recursive Feature Elimination
- 3.2 SelectKBest
- Final Logistic Regression
- 4.1 Building Final Logistic Regression Model
- 4.2 Final Model Evaluation
- 4.3 Confusion Matrix
- 4.4 Test Run
1. Load and Import Libraries¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE, mutual_info_classif, SelectKBest, chi2
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.model_selection import train_test_split
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv("obesity.csv")
print(df.shape)
df.head(3)
2. Preliminary Logistic Regression¶
We will start by building a Logistic Regression with all the features available and comapre it later to our final logistic model with selected features.
2.1 Building a Logistic Regression Model¶
Split the data into X
and y
¶
X = df.iloc[:,:-1]
y = df['NObeyesdad']
Train Test split¶
Spliting our data to training 70% and testing 30%.
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 0)
Normalize the Data¶
Our features must be in the same scale
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
Logistic Regression Model¶
Include the parameter max_iter=1000
to make sure that the model will converge when you try to fit it.
lr = LogisticRegression(max_iter=1000)
Fit the model¶
Use the .fit()
method on lr
to fit the model to X
and y
.
lr.fit(x_train, y_train)
2.2 Model accuracy¶
# Setting y_pred variable as our predictions for the x_test data
y_pred_test = lr.predict(x_test)
# Setting y_pred variable as our predictions for the x_train data
y_pred_train = lr.predict(x_train)
# true value vs prediction on testing set
print(classification_report(y_test,y_pred_test))
# true value vs prediction on training set
print(classification_report(y_train,y_pred_train))
# checking the proportion of obese and not obese
df['NObeyesdad'].value_counts() / len(df)
We have a balance dataset. We can safely use accuracy
as our scoring method.
note: For unbalance dataset we must use recall, precision and f1 for scoring.
recall
: number of correct prediction out of totalActual Positive
. usecase if we need toreduce the False Negative
like in cancer detection, we dont want our model to predict a person to be negative while the truth is positive.precision
: number of positive in Actual out totalPositive prediction
. i.e in cases we need toreduce the False Positives
like in spam detection, we dont want to predict an important mail to be a spam.
Let’s define a variables of our feature selection methods.
number_of_features
the number of featuresscoring_method
;- for classifiers {‘accuracy’, ‘f1’, ‘precision’, ‘recall’, ‘roc_auc’}
- for regressors {‘mean_absolute_error’, ‘mean_squared_error’/’neg_mean_squared_error’, ‘median_absolute_error’, ‘r2’} reference: https://scikit-learn.org/stable/modules/model_evaluation.html
Note: We’re fitting the the orignal X and y variables here instead of the training and testing data set.
# Input variable here:
number_of_features = 8
scoring_method = 'accuracy'
Sequencial Forward Floating Selection¶
sffs = SFS(lr,
k_features=number_of_features,
forward= True,
floating= True,
scoring=scoring_method,
cv=0)
sffs.fit(X, y)
print('SFFS selected features:')
# Saving a copy of sffs selected features in a list variable
X_sffs = list(sffs.subsets_[number_of_features]['feature_names'])
print(X_sffs)
print(str(scoring_method) + ': ' + str(sffs.subsets_[number_of_features]['avg_score']))
Sequencial Backward Floating Selection¶
sbfs = SFS(lr,
k_features=1,
forward= False,
floating= True,
scoring=scoring_method,
cv=0)
sbfs.fit(X, y)
print('SBFS selected features:')
# Saving a copy of sbfs selected features in a list variable
X_sbfs = list(sbfs.subsets_[1]['feature_names'])
print(X_sbfs)
print(str(scoring_method) + ': ' + str(sbfs.subsets_[1]['avg_score']))
Visualize SFFS and SBFS¶
# Visualize sffs
# Visualize in DataFrame (Optional, Uncomment to view)
# pd.DataFrame.from_dict(ssfs.get_metric_dict()).T
fig1 = plot_sfs(sffs.get_metric_dict(), kind='std_dev')
plt.title('Sequential Forward Floating Selection (w. StdDev)')
plt.grid()
plt.show()
#Visualize sbfs
fig2 = plot_sfs(sbfs.get_metric_dict(), kind='std_dev')
plt.title('Sequential Backward Floating Selection (w. StdDev)')
plt.grid()
plt.show()
Base on both graph, eight number of features is the best choice.
3.2 RFE¶
Ranking features by importance, discarding the least important features, and re-fitting the model. This process is repeated until a specified number of features remains.
Before doing applying recursive feature elimination it is necessary to standardize the data.
# Setting a varible 'X_standard' so that our original X features will remain unaffected
# if we rerun this code
X_standard = StandardScaler().fit_transform(X)
rfe = RFE(lr, n_features_to_select=number_of_features)
rfe.fit(X_standard,y)
rfe_features = [f for (f, support) in zip(df.iloc[:,:-1], rfe.support_) if support]
print('RFE selected features:')
X_rfe = rfe_features
print(X_rfe)
print(rfe.score(X_standard,y))
Using RFE we need to standarized our features unlike the SFS’s method we perform the features fiting un normalized.
3.3 SelectKBest¶
We will be using the chi2
and mutual_info_classif
for this test
Select K Best most common test methods;
f_classif
ANOVA F-value between label/feature for classification tasks.
mutual_info_classif
Mutual information for a discrete target.
chi2
Chi-squared stats of non-negative features for classification tasks.
f_regression
F-value between label/feature for regression tasks.
mutual_info_regression
Mutual information for a continuous target.
SelectKBest Chi2¶
# because we want to specify additional arguments (random_state=0)
# besides the features and targets inputs, we’ll need the help of the partial()
# score_function = partial(test_method, random_state=0)
selection = SelectKBest(score_func = chi2, k = number_of_features)
# fit the fata
selection.fit_transform(X, y)
# # saving a copy of chi2 slected features in a list
X_chi2 = X[X.columns[selection.get_support(indices=True)]]
X_chi2 = list(X_chi2)
X_chi2
SelectKBest mutual_info_classif¶
selection = SelectKBest(score_func = mutual_info_classif, k = number_of_features)
# fit the fata
selection.fit_transform(X, y)
# saving a copy of mutual_info_classifslected features in a list
X_mutual_info_classif = X[X.columns[selection.get_support(indices=True)]]
X_mutual_info_classif = list(X_mutual_info_classif)
X_mutual_info_classif
4. Final Logistic Regression¶
Logistic Regression with selected fearutes
.
4.1 Building Final Logistic Regression¶
So far we have five different list of selected features, What we can do for now is to try each list and find the one which has the highest scores.
In this case we will using the X_sffs
.
- X_sffs
- X_sbfs
- X_rfe
- X_chi2
- X_mutual_info_classif
X = df[X_sffs]
y = df['NObeyesdad']
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 0)
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
lr = LogisticRegression(max_iter=1000)
lr.fit(x_train, y_train)
4.2 Model Evaluation¶
y_pred = lr.predict(x_test)
# print(classification_report(y_test,y_pred))
Accuracy2 = accuracy_score(y_test, y_pred)
Precision2 = precision_score(y_test, y_pred)
Recall2 = recall_score(y_test, y_pred)
F1_score2 = f1_score(y_test, y_pred)
pd.DataFrame([['accuracy', Accuracy1, Accuracy2, Accuracy2 - Accuracy1 ],
['precision', Precision1, Precision2, Precision2 - Precision1 ],
['recall', Recall1, Recall2, Recall2 - Recall1 ],
['f1_score', F1_score1, F1_score2, F1_score2 - F1_score1 ]],
columns= ['Score', 'Preliminary', 'Final', 'Difference'])
4.3 Confusion Matrix¶
model_matrix = confusion_matrix(y_test, y_pred)
model_matrix
# Visualize
fig, ax = plt.subplots(figsize=(8,5))
# setting variables
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ['{0:0.0f}'.format(value) for value in model_matrix.flatten()]
group_percentages = ['{0:.2%}'.format(value) for value in model_matrix.flatten()/np.sum(model_matrix)]
labels = [f'{v1}\n{v2}\n{v3}' for v1, v2, v3 in zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(model_matrix, annot=labels, fmt='', cmap='Purples')
4.4 Test Run¶
# calling our X_sffs subset and df['NObeyesdad'] for reference
# print(df[X_sffs].columns)
df[['Gender', 'Age', 'family_history_with_overweight', 'FAVC', 'CAEC',
'SCC', 'FAF', 'Walking', 'NObeyesdad']].head(2)
# setting variable for alyx
alyx = [[0,21,1,0,1,0,0,0]]
alyx = scaler.transform(alyx)
alyx
lr.predict(alyx)