Logistic Regression¶

Classification for obesity.

$obesity.jpg$

Data source: Obesity survey based on eating habits

In this project, We will analyze data from a survey conducted by Fabio Mendoza Palechor and Alexis de la Hoz Manotas that asked people about their eating habits and weight. The data was obtained from the [UCI Machine Learning Repository]. Categorical variables were changed to numerical ones in order to facilitate analysis.

First, We will fit a logistic regression model to try to predict whether survey respondents are obese based on their answers to questions in the survey. After that, We will use three different wrapper methods to choose a smaller feature subset.

We will be using sequential forward selection, sequential backward floating selection, and recursive feature elimination. After implementing each wrapper method, We then evaluate the model accuracy on the resulting smaller feature subsets and compare that with the model accuracy using all available features.

Data Dictionary

The data set obesity contains 18 predictor variables. Here’s a brief description of them.

Gender is 1 if a respondent is male and 0 if a respondent is female.
Age is a respondent’s age in years.
family_history_with_overweight is 1 if a respondent has family member who is or was overweight, 0 if not.
FAVC is 1 if a respondent eats high caloric food frequently, 0 if not.
FCVC is 1 if a respondent usually eats vegetables in their meals, 0 if not.
NCP represents how many main meals a respondent has daily (0 for 1-2 meals, 1 for 3 meals, and 2 for more than 3 meals).
CAEC represents how much food a respondent eats between meals on a scale of 0 to 3.
SMOKE is 1 if a respondent smokes, 0 if not.
CH2O represents how much water a respondent drinks on a scale of 0 to 2.
SCC is 1 if a respondent monitors their caloric intake, 0 if not.
FAF represents how much physical activity a respondent does on a scale of 0 to 3.
TUE represents how much time a respondent spends looking at devices with screens on a scale of 0 to 2.
CALC represents how often a respondent drinks alcohol on a scale of 0 to 3.
Automobile, Bike, Motorbike, Public_Transportation, and Walking indicate a respondent’s primary mode of transportation. Their primary mode of transportation is indicated by a 1 and the other columns will contain a 0.

The outcome variable, NObeyesdad, is a 1 if a patient is obese and a 0 if not.

Content¶

Import and Load Libraries
Preliminary Logistic Regression
- 2.1 Building a Logistic Regression Model
- 2.2 Model Evaluation
Feature Selection
- 3.1 Wrapper Methods
- 3.2 Recursive Feature Elimination
- 3.2 SelectKBest
Final Logistic Regression
- 4.1 Building Final Logistic Regression Model
- 4.2 Final Model Evaluation
- 4.3 Confusion Matrix
- 4.4 Test Run

1. Load and Import Libraries¶

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE, mutual_info_classif, SelectKBest, chi2
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.model_selection import train_test_split
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv("obesity.csv")
print(df.shape)
df.head(3)

(2111, 19)

2. Preliminary Logistic Regression¶

We will start by building a Logistic Regression with all the features available and comapre it later to our final logistic model with selected features.

2.1 Building a Logistic Regression Model¶

Split the data into `X` and `y`¶

X = df.iloc[:,:-1]
y = df['NObeyesdad']

Train Test split¶

Spliting our data to training 70% and testing 30%.

x_train, x_test, y_train, y_test = train_test_split(X, y,  test_size = 0.30, random_state = 0)

Normalize the Data¶

Our features must be in the same scale

scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

Logistic Regression Model¶

Include the parameter max_iter=1000 to make sure that the model will converge when you try to fit it.

lr = LogisticRegression(max_iter=1000)

Fit the model¶

Use the .fit() method on lr to fit the model to X and y.

lr.fit(x_train, y_train)

LogisticRegression(max_iter=1000)

LogisticRegression(max_iter=1000)

2.2 Model accuracy¶

# Setting y_pred variable as our predictions for the x_test data
y_pred_test = lr.predict(x_test)

# Setting y_pred variable as our predictions for the x_train data
y_pred_train = lr.predict(x_train)

# true value vs prediction on testing set
print(classification_report(y_test,y_pred_test))

              precision    recall  f1-score   support

           0       0.86      0.70      0.77       340
           1       0.71      0.87      0.78       294

    accuracy                           0.78       634
   macro avg       0.79      0.78      0.78       634
weighted avg       0.79      0.78      0.78       634

# true value vs prediction on training set
print(classification_report(y_train,y_pred_train))

              precision    recall  f1-score   support

           0       0.83      0.69      0.75       799
           1       0.69      0.84      0.76       678

    accuracy                           0.76      1477
   macro avg       0.76      0.76      0.76      1477
weighted avg       0.77      0.76      0.76      1477

3. Feature Selection¶

Evaluating a Logistic Regression Model

3.1 Wrapper Methods¶

Now that we’ve created a logistic regression model and evaluated its performance, we’re ready to do some feature selection.

# checking the proportion of obese and not obese
df['NObeyesdad'].value_counts() / len(df)

0    0.539555
1    0.460445
Name: NObeyesdad, dtype: float64

We have a balance dataset. We can safely use accuracy as our scoring method.

note: For unbalance dataset we must use recall, precision and f1 for scoring.

recall: number of correct prediction out of total Actual Positive. usecase if we need to reduce the False Negative like in cancer detection, we dont want our model to predict a person to be negative while the truth is positive.
precision: number of positive in Actual out total Positive prediction. i.e in cases we need to reduce the False Positives like in spam detection, we dont want to predict an important mail to be a spam.

Let’s define a variables of our feature selection methods.

number_of_features the number of features
scoring_method;
- for classifiers {‘accuracy’, ‘f1’, ‘precision’, ‘recall’, ‘roc_auc’}
- for regressors {‘mean_absolute_error’, ‘mean_squared_error’/’neg_mean_squared_error’, ‘median_absolute_error’, ‘r2’} reference: https://scikit-learn.org/stable/modules/model_evaluation.html

Note: We’re fitting the the orignal X and y variables here instead of the training and testing data set.

# Input variable here:
number_of_features = 8
scoring_method = 'accuracy'

Sequencial Forward Floating Selection¶

sffs = SFS(lr,
        k_features=number_of_features,
        forward= True,
        floating= True,
        scoring=scoring_method,
        cv=0)
sffs.fit(X, y)
print('SFFS selected features:')

# Saving a copy of sffs selected features in a list variable
X_sffs = list(sffs.subsets_[number_of_features]['feature_names'])

print(X_sffs)    
print(str(scoring_method) + ': ' +  str(sffs.subsets_[number_of_features]['avg_score']))

SFFS selected features:
['Gender', 'Age', 'family_history_with_overweight', 'FAVC', 'CAEC', 'SCC', 'FAF', 'Walking']
accuracy: 0.7835149218379914

Sequencial Backward Floating Selection¶

sbfs = SFS(lr,
            k_features=1,
            forward= False,
            floating= True,
            scoring=scoring_method,
            cv=0)
sbfs.fit(X, y)
print('SBFS selected features:')

# Saving a copy of sbfs selected features in a list variable
X_sbfs = list(sbfs.subsets_[1]['feature_names'])

print(X_sbfs)
print(str(scoring_method) + ': ' +  str(sbfs.subsets_[1]['avg_score']))

SBFS selected features:
['family_history_with_overweight']
accuracy: 0.635243960208432

Visualize SFFS and SBFS¶

# Visualize sffs
# Visualize in DataFrame (Optional, Uncomment to view)
# pd.DataFrame.from_dict(ssfs.get_metric_dict()).T
fig1 = plot_sfs(sffs.get_metric_dict(), kind='std_dev')
plt.title('Sequential Forward Floating Selection (w. StdDev)')
plt.grid()
plt.show()

#Visualize sbfs
fig2 = plot_sfs(sbfs.get_metric_dict(), kind='std_dev')
plt.title('Sequential Backward Floating Selection (w. StdDev)')
plt.grid()
plt.show()

Base on both graph, eight number of features is the best choice.

3.2 RFE¶

Ranking features by importance, discarding the least important features, and re-fitting the model. This process is repeated until a specified number of features remains.

Before doing applying recursive feature elimination it is necessary to standardize the data.

# Setting a varible 'X_standard' so that our original X features will remain unaffected
# if we rerun this code
X_standard = StandardScaler().fit_transform(X)
    
rfe = RFE(lr, n_features_to_select=number_of_features)
    
rfe.fit(X_standard,y)
    
rfe_features = [f for (f, support) in zip(df.iloc[:,:-1], rfe.support_) if support]
print('RFE selected features:')
X_rfe = rfe_features
print(X_rfe)
print(rfe.score(X_standard,y))

RFE selected features:
['Age', 'family_history_with_overweight', 'FAVC', 'FCVC', 'CAEC', 'SCC', 'Automobile', 'Walking']
0.7678825201326386

Using RFE we need to standarized our features unlike the SFS’s method we perform the features fiting un normalized.

3.3 SelectKBest¶

We will be using the chi2 and mutual_info_classif for this test

Select K Best most common test methods;

f_classif ANOVA F-value between label/feature for classification tasks.

mutual_info_classif Mutual information for a discrete target.

chi2 Chi-squared stats of non-negative features for classification tasks.

f_regression F-value between label/feature for regression tasks.

mutual_info_regression Mutual information for a continuous target.

SelectKBest Chi2¶

# because we want to specify additional arguments (random_state=0) 
# besides the features and targets inputs, we’ll need the help of the partial()
# score_function = partial(test_method, random_state=0)
selection = SelectKBest(score_func = chi2, k = number_of_features)
 
# fit the fata    
selection.fit_transform(X, y)

# # saving a copy of chi2 slected features in a list
X_chi2 = X[X.columns[selection.get_support(indices=True)]]
X_chi2 = list(X_chi2)
X_chi2

['Age',
 'family_history_with_overweight',
 'FAVC',
 'CAEC',
 'SCC',
 'FAF',
 'TUE',
 'Walking']

SelectKBest mutual_info_classif¶

selection = SelectKBest(score_func = mutual_info_classif, k = number_of_features)
 
# fit the fata    
selection.fit_transform(X, y)

# saving a copy of mutual_info_classifslected features in a list
X_mutual_info_classif = X[X.columns[selection.get_support(indices=True)]]
X_mutual_info_classif = list(X_mutual_info_classif)
X_mutual_info_classif

['Age',
 'family_history_with_overweight',
 'FAVC',
 'NCP',
 'CAEC',
 'CH2O',
 'FAF',
 'TUE']

4. Final Logistic Regression¶

Logistic Regression with selected fearutes.

4.1 Building Final Logistic Regression¶

So far we have five different list of selected features, What we can do for now is to try each list and find the one which has the highest scores.

In this case we will using the X_sffs.

X_sffs
X_sbfs
X_rfe
X_chi2
X_mutual_info_classif

X = df[X_sffs]
y = df['NObeyesdad']

x_train, x_test, y_train, y_test = train_test_split(X, y,  test_size = 0.30, random_state = 0)

scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

lr = LogisticRegression(max_iter=1000)

lr.fit(x_train, y_train)

LogisticRegression(max_iter=1000)

LogisticRegression(max_iter=1000)

4.2 Model Evaluation¶

y_pred = lr.predict(x_test)

# print(classification_report(y_test,y_pred))

Accuracy2  = accuracy_score(y_test, y_pred)
Precision2 = precision_score(y_test, y_pred)
Recall2    = recall_score(y_test, y_pred)
F1_score2  = f1_score(y_test, y_pred)

pd.DataFrame([['accuracy',  Accuracy1,  Accuracy2,  Accuracy2  - Accuracy1  ],
              ['precision', Precision1, Precision2, Precision2 - Precision1 ],
              ['recall',    Recall1,    Recall2,    Recall2    - Recall1    ],  
              ['f1_score',  F1_score1,  F1_score2,  F1_score2  - F1_score1  ]],
            
              columns= ['Score', 'Preliminary', 'Final', 'Difference'])

4.3 Confusion Matrix¶

model_matrix = confusion_matrix(y_test, y_pred)
model_matrix

array([[232, 108],
       [ 20, 274]], dtype=int64)

# Visualize
fig, ax = plt.subplots(figsize=(8,5))

# setting variables
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ['{0:0.0f}'.format(value) for value in model_matrix.flatten()]
group_percentages = ['{0:.2%}'.format(value) for value in model_matrix.flatten()/np.sum(model_matrix)]
labels = [f'{v1}\n{v2}\n{v3}' for v1, v2, v3 in zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)

sns.heatmap(model_matrix, annot=labels, fmt='', cmap='Purples')

<AxesSubplot:>

4.4 Test Run¶

# calling our X_sffs subset and df['NObeyesdad'] for reference
# print(df[X_sffs].columns) 
df[['Gender', 'Age', 'family_history_with_overweight', 'FAVC', 'CAEC',
       'SCC', 'FAF', 'Walking', 'NObeyesdad']].head(2)

# setting variable for alyx
alyx = [[0,21,1,0,1,0,0,0]]
alyx = scaler.transform(alyx)
alyx

lr.predict(alyx)

array([0], dtype=int64)

	Score	Preliminary	Final	Difference
0	accuracy	0.777603	0.798107	0.020505
1	precision	0.714286	0.717277	0.002992
2	recall	0.867347	0.931973	0.064626
3	f1_score	0.783410	0.810651	0.027241

	Gender	Age	family_history_with_overweight	FCVC	NCP	CAEC	SMOKE	CH2O	SCC	FAF	TUE	CALC	Public_Transportation
0	0	21.0	1	2.0	3.0	1	0	2.0	0	0.0	1.0	0	1
1	0	21.0	1	3.0	3.0	1	1	3.0	1	3.0	0.0	1	1
2	1	23.0	1	2.0	3.0	1	0	2.0	0	2.0	1.0	2	1

	Gender	Age	family_history_with_overweight	FAVC	CAEC	SCC	FAF	Walking	NObeyesdad
0	0	21.0	1	0	1	0	0.0	0	0
1	0	21.0	1	0	1	1	3.0	0	0