# Logistic Regression¶

Classification for obesity.

Data source: Obesity survey based on eating habits

In this project, We will analyze data from a survey conducted by Fabio Mendoza Palechor and Alexis de la Hoz Manotas that asked people about their eating habits and weight. The data was obtained from the [UCI Machine Learning Repository]. Categorical variables were changed to numerical ones in order to facilitate analysis.

First, We will fit a logistic regression model to try to predict whether survey respondents are obese based on their answers to questions in the survey. After that, We will use three different wrapper methods to choose a smaller feature subset.

We will be using `sequential forward selection`

, `sequential backward floating selection`

, and `recursive feature elimination`

. After implementing each wrapper method, We then evaluate the model accuracy on the resulting smaller feature subsets and compare that with the model accuracy using all available features.

`Data Dictionary`

The data set `obesity`

contains 18 predictor variables. Here’s a brief description of them.

`Gender`

is`1`

if a respondent is male and`0`

if a respondent is female.`Age`

is a respondent’s age in years.`family_history_with_overweight`

is`1`

if a respondent has family member who is or was overweight,`0`

if not.`FAVC`

is`1`

if a respondent eats high caloric food frequently,`0`

if not.`FCVC`

is`1`

if a respondent usually eats vegetables in their meals,`0`

if not.`NCP`

represents how many main meals a respondent has daily (`0`

for 1-2 meals,`1`

for 3 meals, and`2`

for more than 3 meals).`CAEC`

represents how much food a respondent eats between meals on a scale of`0`

to`3`

.`SMOKE`

is`1`

if a respondent smokes,`0`

if not.`CH2O`

represents how much water a respondent drinks on a scale of`0`

to`2`

.`SCC`

is`1`

if a respondent monitors their caloric intake,`0`

if not.`FAF`

represents how much physical activity a respondent does on a scale of`0`

to`3`

.`TUE`

represents how much time a respondent spends looking at devices with screens on a scale of`0`

to`2`

.`CALC`

represents how often a respondent drinks alcohol on a scale of`0`

to`3`

.`Automobile`

,`Bike`

,`Motorbike`

,`Public_Transportation`

, and`Walking`

indicate a respondent’s primary mode of transportation. Their primary mode of transportation is indicated by a`1`

and the other columns will contain a`0`

.

The outcome variable, `NObeyesdad`

, is a `1`

if a patient is obese and a `0`

if not.

# Content¶

- Import and Load Libraries
- Preliminary Logistic Regression
- 2.1 Building a Logistic Regression Model
- 2.2 Model Evaluation

- Feature Selection
- 3.1 Wrapper Methods
- 3.2 Recursive Feature Elimination
- 3.2 SelectKBest

- Final Logistic Regression
- 4.1 Building Final Logistic Regression Model
- 4.2 Final Model Evaluation
- 4.3 Confusion Matrix
- 4.4 Test Run

# 1. Load and Import Libraries¶

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE, mutual_info_classif, SelectKBest, chi2
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.model_selection import train_test_split
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import warnings
warnings.filterwarnings('ignore')
```

```
df = pd.read_csv("obesity.csv")
print(df.shape)
df.head(3)
```

# 2. Preliminary Logistic Regression¶

We will start by building a Logistic Regression with all the features available and comapre it later to our final logistic model with selected features.

### 2.1 Building a Logistic Regression Model¶

##### Split the data into `X`

and `y`

¶

```
X = df.iloc[:,:-1]
y = df['NObeyesdad']
```

##### Train Test split¶

Spliting our data to training 70% and testing 30%.

```
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 0)
```

##### Normalize the Data¶

Our features must be in the same scale

```
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
```

##### Logistic Regression Model¶

Include the parameter `max_iter=1000`

to make sure that the model will converge when you try to fit it.

```
lr = LogisticRegression(max_iter=1000)
```

##### Fit the model¶

Use the `.fit()`

method on `lr`

to fit the model to `X`

and `y`

.

```
lr.fit(x_train, y_train)
```

### 2.2 Model accuracy¶

```
# Setting y_pred variable as our predictions for the x_test data
y_pred_test = lr.predict(x_test)
# Setting y_pred variable as our predictions for the x_train data
y_pred_train = lr.predict(x_train)
```

```
# true value vs prediction on testing set
print(classification_report(y_test,y_pred_test))
```

```
# true value vs prediction on training set
print(classification_report(y_train,y_pred_train))
```

```
# checking the proportion of obese and not obese
df['NObeyesdad'].value_counts() / len(df)
```

We have a balance dataset. We can safely use `accuracy`

as our scoring method.

note: For unbalance dataset we must use recall, precision and f1 for scoring.

`recall`

: number of correct prediction out of total`Actual Positive`

. usecase if we need to`reduce the False Negative`

like in cancer detection, we dont want our model to predict a person to be negative while the truth is positive.`precision`

: number of positive in Actual out total`Positive prediction`

. i.e in cases we need to`reduce the False Positives`

like in spam detection, we dont want to predict an important mail to be a spam.

Let’s define a variables of our feature selection methods.

`number_of_features`

the number of features`scoring_method`

;- for classifiers {‘accuracy’, ‘f1’, ‘precision’, ‘recall’, ‘roc_auc’}
- for regressors {‘mean_absolute_error’, ‘mean_squared_error’/’neg_mean_squared_error’, ‘median_absolute_error’, ‘r2’} reference: https://scikit-learn.org/stable/modules/model_evaluation.html

Note: We’re fitting the the orignal X and y variables here instead of the training and testing data set.

```
# Input variable here:
number_of_features = 8
scoring_method = 'accuracy'
```

##### Sequencial Forward Floating Selection¶

```
sffs = SFS(lr,
k_features=number_of_features,
forward= True,
floating= True,
scoring=scoring_method,
cv=0)
sffs.fit(X, y)
print('SFFS selected features:')
# Saving a copy of sffs selected features in a list variable
X_sffs = list(sffs.subsets_[number_of_features]['feature_names'])
print(X_sffs)
print(str(scoring_method) + ': ' + str(sffs.subsets_[number_of_features]['avg_score']))
```

##### Sequencial Backward Floating Selection¶

```
sbfs = SFS(lr,
k_features=1,
forward= False,
floating= True,
scoring=scoring_method,
cv=0)
sbfs.fit(X, y)
print('SBFS selected features:')
# Saving a copy of sbfs selected features in a list variable
X_sbfs = list(sbfs.subsets_[1]['feature_names'])
print(X_sbfs)
print(str(scoring_method) + ': ' + str(sbfs.subsets_[1]['avg_score']))
```

#### Visualize SFFS and SBFS¶

```
# Visualize sffs
# Visualize in DataFrame (Optional, Uncomment to view)
# pd.DataFrame.from_dict(ssfs.get_metric_dict()).T
fig1 = plot_sfs(sffs.get_metric_dict(), kind='std_dev')
plt.title('Sequential Forward Floating Selection (w. StdDev)')
plt.grid()
plt.show()
#Visualize sbfs
fig2 = plot_sfs(sbfs.get_metric_dict(), kind='std_dev')
plt.title('Sequential Backward Floating Selection (w. StdDev)')
plt.grid()
plt.show()
```

Base on both graph, eight number of features is the best choice.

### 3.2 RFE¶

Ranking features by importance, discarding the least important features, and re-fitting the model. This process is repeated until a specified number of features remains.

Before doing applying recursive feature elimination it is necessary to standardize the data.

```
# Setting a varible 'X_standard' so that our original X features will remain unaffected
# if we rerun this code
X_standard = StandardScaler().fit_transform(X)
rfe = RFE(lr, n_features_to_select=number_of_features)
rfe.fit(X_standard,y)
rfe_features = [f for (f, support) in zip(df.iloc[:,:-1], rfe.support_) if support]
print('RFE selected features:')
X_rfe = rfe_features
print(X_rfe)
print(rfe.score(X_standard,y))
```

Using RFE we need to standarized our features unlike the SFS’s method we perform the features fiting un normalized.

### 3.3 SelectKBest¶

We will be using the `chi2`

and `mutual_info_classif`

for this test

Select K Best most common test methods;

`f_classif`

ANOVA F-value between label/feature for classification tasks.

`mutual_info_classif`

Mutual information for a discrete target.

`chi2`

Chi-squared stats of non-negative features for classification tasks.

`f_regression`

F-value between label/feature for regression tasks.

`mutual_info_regression`

Mutual information for a continuous target.

#### SelectKBest Chi2¶

```
# because we want to specify additional arguments (random_state=0)
# besides the features and targets inputs, we’ll need the help of the partial()
# score_function = partial(test_method, random_state=0)
selection = SelectKBest(score_func = chi2, k = number_of_features)
# fit the fata
selection.fit_transform(X, y)
# # saving a copy of chi2 slected features in a list
X_chi2 = X[X.columns[selection.get_support(indices=True)]]
X_chi2 = list(X_chi2)
X_chi2
```

#### SelectKBest mutual_info_classif¶

```
selection = SelectKBest(score_func = mutual_info_classif, k = number_of_features)
# fit the fata
selection.fit_transform(X, y)
# saving a copy of mutual_info_classifslected features in a list
X_mutual_info_classif = X[X.columns[selection.get_support(indices=True)]]
X_mutual_info_classif = list(X_mutual_info_classif)
X_mutual_info_classif
```

# 4. Final Logistic Regression¶

Logistic Regression with `selected fearutes`

.

### 4.1 Building Final Logistic Regression¶

So far we have five different list of selected features, What we can do for now is to try each list and find the one which has the highest scores.

In this case we will using the `X_sffs`

.

- X_sffs
- X_sbfs
- X_rfe
- X_chi2
- X_mutual_info_classif

```
X = df[X_sffs]
y = df['NObeyesdad']
```

```
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 0)
```

```
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
```

```
lr = LogisticRegression(max_iter=1000)
```

```
lr.fit(x_train, y_train)
```

### 4.2 Model Evaluation¶

```
y_pred = lr.predict(x_test)
```

```
# print(classification_report(y_test,y_pred))
```

```
Accuracy2 = accuracy_score(y_test, y_pred)
Precision2 = precision_score(y_test, y_pred)
Recall2 = recall_score(y_test, y_pred)
F1_score2 = f1_score(y_test, y_pred)
```

```
pd.DataFrame([['accuracy', Accuracy1, Accuracy2, Accuracy2 - Accuracy1 ],
['precision', Precision1, Precision2, Precision2 - Precision1 ],
['recall', Recall1, Recall2, Recall2 - Recall1 ],
['f1_score', F1_score1, F1_score2, F1_score2 - F1_score1 ]],
columns= ['Score', 'Preliminary', 'Final', 'Difference'])
```

### 4.3 Confusion Matrix¶

```
model_matrix = confusion_matrix(y_test, y_pred)
model_matrix
```

```
# Visualize
fig, ax = plt.subplots(figsize=(8,5))
# setting variables
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ['{0:0.0f}'.format(value) for value in model_matrix.flatten()]
group_percentages = ['{0:.2%}'.format(value) for value in model_matrix.flatten()/np.sum(model_matrix)]
labels = [f'{v1}\n{v2}\n{v3}' for v1, v2, v3 in zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(model_matrix, annot=labels, fmt='', cmap='Purples')
```

### 4.4 Test Run¶

```
# calling our X_sffs subset and df['NObeyesdad'] for reference
# print(df[X_sffs].columns)
df[['Gender', 'Age', 'family_history_with_overweight', 'FAVC', 'CAEC',
'SCC', 'FAF', 'Walking', 'NObeyesdad']].head(2)
```

```
# setting variable for alyx
alyx = [[0,21,1,0,1,0,0,0]]
alyx = scaler.transform(alyx)
alyx
lr.predict(alyx)
```