Logistic Regression

Logistic_Regression

Logistic Regression

Classification for obesity.

obesity.jpg

In this project, We will analyze data from a survey conducted by Fabio Mendoza Palechor and Alexis de la Hoz Manotas that asked people about their eating habits and weight. The data was obtained from the [UCI Machine Learning Repository]. Categorical variables were changed to numerical ones in order to facilitate analysis.

First, We will fit a logistic regression model to try to predict whether survey respondents are obese based on their answers to questions in the survey. After that, We will use three different wrapper methods to choose a smaller feature subset.

We will be using sequential forward selection, sequential backward floating selection, and recursive feature elimination. After implementing each wrapper method, We then evaluate the model accuracy on the resulting smaller feature subsets and compare that with the model accuracy using all available features.

Data Dictionary

The data set obesity contains 18 predictor variables. Here’s a brief description of them.

  • Gender is 1 if a respondent is male and 0 if a respondent is female.
  • Age is a respondent’s age in years.
  • family_history_with_overweight is 1 if a respondent has family member who is or was overweight, 0 if not.
  • FAVC is 1 if a respondent eats high caloric food frequently, 0 if not.
  • FCVC is 1 if a respondent usually eats vegetables in their meals, 0 if not.
  • NCP represents how many main meals a respondent has daily (0 for 1-2 meals, 1 for 3 meals, and 2 for more than 3 meals).
  • CAEC represents how much food a respondent eats between meals on a scale of 0 to 3.
  • SMOKE is 1 if a respondent smokes, 0 if not.
  • CH2O represents how much water a respondent drinks on a scale of 0 to 2.
  • SCC is 1 if a respondent monitors their caloric intake, 0 if not.
  • FAF represents how much physical activity a respondent does on a scale of 0 to 3.
  • TUE represents how much time a respondent spends looking at devices with screens on a scale of 0 to 2.
  • CALC represents how often a respondent drinks alcohol on a scale of 0 to 3.
  • Automobile, Bike, Motorbike, Public_Transportation, and Walking indicate a respondent’s primary mode of transportation. Their primary mode of transportation is indicated by a 1 and the other columns will contain a 0.

The outcome variable, NObeyesdad, is a 1 if a patient is obese and a 0 if not.

Content

  1. Import and Load Libraries
  2. Preliminary Logistic Regression
    • 2.1 Building a Logistic Regression Model
    • 2.2 Model Evaluation
  3. Feature Selection
    • 3.1 Wrapper Methods
    • 3.2 Recursive Feature Elimination
    • 3.2 SelectKBest
  4. Final Logistic Regression
    • 4.1 Building Final Logistic Regression Model
    • 4.2 Final Model Evaluation
    • 4.3 Confusion Matrix
    • 4.4 Test Run

1. Load and Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE, mutual_info_classif, SelectKBest, chi2
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.model_selection import train_test_split
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import warnings
warnings.filterwarnings('ignore')
In [4]:
df = pd.read_csv("obesity.csv")
print(df.shape)
df.head(3)
(2111, 19)
Out[4]:
Gender Age family_history_with_overweight FAVC FCVC NCP CAEC SMOKE CH2O SCC FAF TUE CALC Automobile Bike Motorbike Public_Transportation Walking NObeyesdad
0 0 21.0 1 0 2.0 3.0 1 0 2.0 0 0.0 1.0 0 0 0 0 1 0 0
1 0 21.0 1 0 3.0 3.0 1 1 3.0 1 3.0 0.0 1 0 0 0 1 0 0
2 1 23.0 1 0 2.0 3.0 1 0 2.0 0 2.0 1.0 2 0 0 0 1 0 0

2. Preliminary Logistic Regression

We will start by building a Logistic Regression with all the features available and comapre it later to our final logistic model with selected features.

2.1 Building a Logistic Regression Model

Split the data into X and y
In [4]:
X = df.iloc[:,:-1]
y = df['NObeyesdad']
Train Test split

Spliting our data to training 70% and testing 30%.

In [5]:
x_train, x_test, y_train, y_test = train_test_split(X, y,  test_size = 0.30, random_state = 0)
Normalize the Data

Our features must be in the same scale

In [6]:
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
Logistic Regression Model

Include the parameter max_iter=1000 to make sure that the model will converge when you try to fit it.

In [7]:
lr = LogisticRegression(max_iter=1000)
Fit the model

Use the .fit() method on lr to fit the model to X and y.

In [8]:
lr.fit(x_train, y_train)
Out[8]:
LogisticRegression(max_iter=1000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

2.2 Model accuracy

In [22]:
# Setting y_pred variable as our predictions for the x_test data
y_pred_test = lr.predict(x_test)

# Setting y_pred variable as our predictions for the x_train data
y_pred_train = lr.predict(x_train)
In [23]:
# true value vs prediction on testing set
print(classification_report(y_test,y_pred_test))
              precision    recall  f1-score   support

           0       0.86      0.70      0.77       340
           1       0.71      0.87      0.78       294

    accuracy                           0.78       634
   macro avg       0.79      0.78      0.78       634
weighted avg       0.79      0.78      0.78       634

In [24]:
# true value vs prediction on training set
print(classification_report(y_train,y_pred_train))
              precision    recall  f1-score   support

           0       0.83      0.69      0.75       799
           1       0.69      0.84      0.76       678

    accuracy                           0.76      1477
   macro avg       0.76      0.76      0.76      1477
weighted avg       0.77      0.76      0.76      1477

3. Feature Selection

Evaluating a Logistic Regression Model

3.1 Wrapper Methods

Now that we’ve created a logistic regression model and evaluated its performance, we’re ready to do some feature selection.

In [60]:
# checking the proportion of obese and not obese
df['NObeyesdad'].value_counts() / len(df)
Out[60]:
0    0.539555
1    0.460445
Name: NObeyesdad, dtype: float64

We have a balance dataset. We can safely use accuracy as our scoring method.

note: For unbalance dataset we must use recall, precision and f1 for scoring.

  • recall: number of correct prediction out of total Actual Positive. usecase if we need to reduce the False Negative like in cancer detection, we dont want our model to predict a person to be negative while the truth is positive.

  • precision: number of positive in Actual out total Positive prediction. i.e in cases we need to reduce the False Positives like in spam detection, we dont want to predict an important mail to be a spam.

Let’s define a variables of our feature selection methods.

  1. number_of_features the number of features
  2. scoring_method;

Note: We’re fitting the the orignal X and y variables here instead of the training and testing data set.

In [11]:
# Input variable here:
number_of_features = 8
scoring_method = 'accuracy'
Sequencial Forward Floating Selection
In [12]:
sffs = SFS(lr,
        k_features=number_of_features,
        forward= True,
        floating= True,
        scoring=scoring_method,
        cv=0)
sffs.fit(X, y)
print('SFFS selected features:')

# Saving a copy of sffs selected features in a list variable
X_sffs = list(sffs.subsets_[number_of_features]['feature_names'])

print(X_sffs)    
print(str(scoring_method) + ': ' +  str(sffs.subsets_[number_of_features]['avg_score']))
SFFS selected features:
['Gender', 'Age', 'family_history_with_overweight', 'FAVC', 'CAEC', 'SCC', 'FAF', 'Walking']
accuracy: 0.7835149218379914
Sequencial Backward Floating Selection
In [13]:
sbfs = SFS(lr,
            k_features=1,
            forward= False,
            floating= True,
            scoring=scoring_method,
            cv=0)
sbfs.fit(X, y)
print('SBFS selected features:')

# Saving a copy of sbfs selected features in a list variable
X_sbfs = list(sbfs.subsets_[1]['feature_names'])

print(X_sbfs)
print(str(scoring_method) + ': ' +  str(sbfs.subsets_[1]['avg_score']))
SBFS selected features:
['family_history_with_overweight']
accuracy: 0.635243960208432

Visualize SFFS and SBFS

In [14]:
# Visualize sffs
# Visualize in DataFrame (Optional, Uncomment to view)
# pd.DataFrame.from_dict(ssfs.get_metric_dict()).T
fig1 = plot_sfs(sffs.get_metric_dict(), kind='std_dev')
plt.title('Sequential Forward Floating Selection (w. StdDev)')
plt.grid()
plt.show()

#Visualize sbfs
fig2 = plot_sfs(sbfs.get_metric_dict(), kind='std_dev')
plt.title('Sequential Backward Floating Selection (w. StdDev)')
plt.grid()
plt.show()

Base on both graph, eight number of features is the best choice.

3.2 RFE

Ranking features by importance, discarding the least important features, and re-fitting the model. This process is repeated until a specified number of features remains.

Before doing applying recursive feature elimination it is necessary to standardize the data.

In [15]:
# Setting a varible 'X_standard' so that our original X features will remain unaffected
# if we rerun this code
X_standard = StandardScaler().fit_transform(X)
    
rfe = RFE(lr, n_features_to_select=number_of_features)
    
rfe.fit(X_standard,y)
    
rfe_features = [f for (f, support) in zip(df.iloc[:,:-1], rfe.support_) if support]
print('RFE selected features:')
X_rfe = rfe_features
print(X_rfe)
print(rfe.score(X_standard,y))
RFE selected features:
['Age', 'family_history_with_overweight', 'FAVC', 'FCVC', 'CAEC', 'SCC', 'Automobile', 'Walking']
0.7678825201326386

Using RFE we need to standarized our features unlike the SFS’s method we perform the features fiting un normalized.

3.3 SelectKBest

We will be using the chi2 and mutual_info_classif for this test

Select K Best most common test methods;

f_classif ANOVA F-value between label/feature for classification tasks.

mutual_info_classif Mutual information for a discrete target.

chi2 Chi-squared stats of non-negative features for classification tasks.

f_regression F-value between label/feature for regression tasks.

mutual_info_regression Mutual information for a continuous target.

SelectKBest Chi2

In [16]:
# because we want to specify additional arguments (random_state=0) 
# besides the features and targets inputs, we’ll need the help of the partial()
# score_function = partial(test_method, random_state=0)
selection = SelectKBest(score_func = chi2, k = number_of_features)
 
# fit the fata    
selection.fit_transform(X, y)

# # saving a copy of chi2 slected features in a list
X_chi2 = X[X.columns[selection.get_support(indices=True)]]
X_chi2 = list(X_chi2)
X_chi2
Out[16]:
['Age',
 'family_history_with_overweight',
 'FAVC',
 'CAEC',
 'SCC',
 'FAF',
 'TUE',
 'Walking']

SelectKBest mutual_info_classif

In [17]:
selection = SelectKBest(score_func = mutual_info_classif, k = number_of_features)
 
# fit the fata    
selection.fit_transform(X, y)

# saving a copy of mutual_info_classifslected features in a list
X_mutual_info_classif = X[X.columns[selection.get_support(indices=True)]]
X_mutual_info_classif = list(X_mutual_info_classif)
X_mutual_info_classif
Out[17]:
['Age',
 'family_history_with_overweight',
 'FAVC',
 'NCP',
 'CAEC',
 'CH2O',
 'FAF',
 'TUE']

4. Final Logistic Regression

Logistic Regression with selected fearutes.

4.1 Building Final Logistic Regression

So far we have five different list of selected features, What we can do for now is to try each list and find the one which has the highest scores.

In this case we will using the X_sffs.

  1. X_sffs
  2. X_sbfs
  3. X_rfe
  4. X_chi2
  5. X_mutual_info_classif
In [43]:
X = df[X_sffs]
y = df['NObeyesdad']
In [44]:
x_train, x_test, y_train, y_test = train_test_split(X, y,  test_size = 0.30, random_state = 0)
In [45]:
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
In [46]:
lr = LogisticRegression(max_iter=1000)
In [47]:
lr.fit(x_train, y_train)
Out[47]:
LogisticRegression(max_iter=1000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

4.2 Model Evaluation

In [48]:
y_pred = lr.predict(x_test)
In [51]:
# print(classification_report(y_test,y_pred))
In [50]:
Accuracy2  = accuracy_score(y_test, y_pred)
Precision2 = precision_score(y_test, y_pred)
Recall2    = recall_score(y_test, y_pred)
F1_score2  = f1_score(y_test, y_pred)
In [25]:
pd.DataFrame([['accuracy',  Accuracy1,  Accuracy2,  Accuracy2  - Accuracy1  ],
              ['precision', Precision1, Precision2, Precision2 - Precision1 ],
              ['recall',    Recall1,    Recall2,    Recall2    - Recall1    ],  
              ['f1_score',  F1_score1,  F1_score2,  F1_score2  - F1_score1  ]],
            
              columns= ['Score', 'Preliminary', 'Final', 'Difference'])
Out[25]:
Score Preliminary Final Difference
0 accuracy 0.777603 0.798107 0.020505
1 precision 0.714286 0.717277 0.002992
2 recall 0.867347 0.931973 0.064626
3 f1_score 0.783410 0.810651 0.027241

4.3 Confusion Matrix

In [26]:
model_matrix = confusion_matrix(y_test, y_pred)
model_matrix
Out[26]:
array([[232, 108],
       [ 20, 274]], dtype=int64)
In [27]:
# Visualize
fig, ax = plt.subplots(figsize=(8,5))

# setting variables
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ['{0:0.0f}'.format(value) for value in model_matrix.flatten()]
group_percentages = ['{0:.2%}'.format(value) for value in model_matrix.flatten()/np.sum(model_matrix)]
labels = [f'{v1}\n{v2}\n{v3}' for v1, v2, v3 in zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)

sns.heatmap(model_matrix, annot=labels, fmt='', cmap='Purples')
Out[27]:
<AxesSubplot:>

4.4 Test Run

In [28]:
# calling our X_sffs subset and df['NObeyesdad'] for reference
# print(df[X_sffs].columns) 
df[['Gender', 'Age', 'family_history_with_overweight', 'FAVC', 'CAEC',
       'SCC', 'FAF', 'Walking', 'NObeyesdad']].head(2)
Out[28]:
Gender Age family_history_with_overweight FAVC CAEC SCC FAF Walking NObeyesdad
0 0 21.0 1 0 1 0 0.0 0 0
1 0 21.0 1 0 1 1 3.0 0 0
In [29]:
# setting variable for alyx
alyx = [[0,21,1,0,1,0,0,0]]
alyx = scaler.transform(alyx)
alyx

lr.predict(alyx)
Out[29]:
array([0], dtype=int64)

Leave a Reply