Ridge and Lasso¶
Why Regularized?
Ridge and Lasso are regularization technique use to minimize model overffitng and prevent bias predictions. Having the training dataset score very well but performs significantly worse on test data.
Data Source: Wine Quality Data Set
Project Goal:
The original dataset has a 1-10 rating for each wine, I’ve made it a classification problem with a wine quality of good (>5 rating) or bad (<=5 rating). The goals of this project are to:
- Implement Ridge(MSE) and Lasso(MAE) regularization both for logistic and linear regression.
- Find the best alpha value using hyperparameter tuning (GridSearchCV and LogisticRegressionCV.
- Implement a tuned lasso-regularized feature selection method.
Content:
- Logistic Regression Model
- 1.1 Lasso as Feature Selection Method
- 1.2 Logistic Regression (default/Ridge)
- 1.3 Hyperparameter Tunning
- 1.4 Evaluation
- Linear Regression Model
- 2.1 Lasso Regularization for Linear Model
- 2.2 Ridge Regularization for Linear Model
- Conclusion
1. Logistic Regression¶
# Importing libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegressionCV
from sklearn.linear_model import LogisticRegression, Lasso, Ridge, LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, classification_report
df = pd.read_csv('wine_quality.csv')
df.head()
1.1 Lasso as Feature Selection Method¶
Instead of using GridSearchCV
, we’re going to use LogisticRegressionCV
. The syntax here is a little different. The arguments to LogisticRegressionCV that are relevant to us:
Cs
: A list/array of C values to check; choose values between 0.01 and 100 here.cv
: Number of folds (5 is a good choice here!)penalty
: Remember to choose ‘l1’ for this!solver
: Recall that L1 penalty requires that we specify the solver to be ‘liblinear’.scoring
: ‘f1’ is still a great choice for a classifier.
(Note that we’re not doing a train-test-validation split like last time!)
# Define X and y
y = df['quality']
features = df.drop(columns = ['quality'])
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)
# array from 0.01 to 100 in 100 values
C_array = np.logspace(-2,2,100)
# instance/object of LogisticRegressionCV
clf_l1 = LogisticRegressionCV(Cs=C_array,
cv = 5,
penalty = 'l1',
scoring = 'f1',
solver = 'liblinear',
max_iter = 1000,
random_state = 99)
# Im fitting the 'scaled_features' here rather than the x_train and y_train
# thats because I want this regularization to act as feature selection method
# feature selection must come first before test_train_split
clf_l1.fit(scaled_features, y)
The classifier has the attribute C_
which prints the optimal C value. The attribute coef_
gives us the coefficients of the best lasso-regularized classifier.
# print('Best C value', clf_l1.C_)
# print('Best fit coefficients', clf_l1.coef_)
print(clf_l1.C_, clf_l1.scores_[1].mean(axis=0).max())
Visualize the coefficients
#
predictors = features.columns
coefficients = clf_l1.coef_.ravel()
coef = pd.Series(coefficients,predictors).sort_values()
plt.figure(figsize = (10,6))
coef.plot(kind='bar', title = 'Coefficients for tuned L1')
plt.tight_layout()
plt.show()
plt.clf()
Notice how our Lasso(L1) classifier has set density
coefficients to zero! We’ve can effectively eliminated this feature from the model, thus using Lasso regularization as a feature selection method here.
1.2 Logistic Regression Model¶
By default SKlearn has regularized the logistic regression with L2 or Ridge regularization and the default alpha is set to 100.
Predict wine quality
- 1 is good
- 0 is bad
y = df['quality']
# X = df.drop(columns = ['quality', 'total sulfur dioxide', 'free sulfur dioxide', 'residual sugar', 'fixed acidity' ])
X = df.drop(columns = ['density', 'quality' ])
Let’s split our dataset
x_train, x_test, y_train, y_test = train_test_split(features, y, test_size = 0.33, random_state = 99)
Let’s scale our data using StandardScaler().
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
Logistic regresion intance and fit the training dataset
clf_default = LogisticRegression()
clf_default.fit(x_train, y_train)
Checking f1 score.
It is important that the classifier not only has high accuracy, but also high precision and recall, i.e., a low false positive and false negative rate.
A metric known as f1 score, which is the weighted mean of precision and recall, captures the performance of a classifier holistically.
# setting variables for the predicted training and testing set
y_pred_train = clf_default.predict(x_train)
y_pred_test = clf_default.predict(x_test)
# print the f1 score of actual vs prediction for training and testing set
l2_training = f1_score(y_train, y_pred_train)
l2_testing = f1_score(y_test, y_pred_test)
print('Ridge-regularized Training Score: ', + l2_training)
print('Ridge-regularized Testing Score: ', + l2_testing)
Let’s find the best C value and fit the model again if we can further improve our score.
Creating a for loop with a list of ‘C’. Fit every ‘C’ in our model then visualize each ‘C’ score.
# Determining the array range to be use in our gridsearchcv
C_array_initial = [0.0001, 0.001, 0.01, 0.1, 1, 2, 3, 4]
training_array = []
test_array = []
for x in C_array_initial:
# random_state is 0 so we can have the same result
clf = LogisticRegression(C = x )
clf.fit(x_train, y_train)
# prediction for the training set
y_pred_train = clf.predict(x_train)
# prediction for the testing set
y_pred_test = clf.predict(x_test)
# actual training values vs predicted training value
training_array.append(f1_score(y_train, y_pred_train))
# actual testing values vs predicted testing value
test_array.append(f1_score(y_test, y_pred_test))
# print(training_array)
# print(test_array)
Visualize the training array and testing array score.
plt.figure(figsize = (15,7))
plt.plot(C_array_initial,training_array, color='r', marker = 'o', label='training score')
plt.plot(C_array_initial,test_array, color='b', marker = 'o', label='testing score')
plt.legend()
plt.xscale('log' )
# make z and y ticks bigger
plt.xlabel('array', fontsize = 20)
plt.ylabel('score', fontsize = 20)
plt.tick_params(axis='both', which='major', labelsize=20)
plt.show()
The optimal C seems to be somewhere around 0.001 so a search window between 0.0001 and 0.01
is not a bad idea here!
1.3 Hyperparameter Tuning¶
GridSearchCV (Optimal result since we have a very small dataset)¶
We’re now ready to perform hyperparameter tuning using GridSearchCV! Looking at the plot, the optimal C seems to be somewhere around 0.001 so a search window between 0.0001 and 0.01 is not a bad idea here.
Let’s first get setup with the right inputs for this. Use np.logspace() to obtain 100 values between 10^(-4) and 10^(-2) and define a dictionary of C values named tuning_C that can function as an input to GridSearchCV‘s parameter grid.
# # search between 0.0001 and 0.01 in 100 values
C_array = np.logspace(-4, -2, 100)
# # C_array
tuning_C = [{'C': C_array}]
from sklearn.model_selection import GridSearchCV
# Create an object for ridge classification
clf_gs_L2 = LogisticRegression()
gs = GridSearchCV(estimator = clf_gs_L2,
param_grid = tuning_C,
scoring = 'f1',
cv = 5, )
# fit trainning dataset
gs.fit(x_train, y_train)
Show best parameters
print(gs.best_params_, gs.best_score_)
Visualize
plt.figure(figsize = (15,7))
plt.plot(C_array_initial,training_array, color='r', marker = 'o', label='training score')
plt.plot(C_array_initial,test_array, color='b', marker = 'o', label='testing score')
plt.xscale('log' )
# make z and y ticks bigger
plt.xlabel('array', fontsize = 20)
plt.ylabel('score', fontsize = 20)
plt.tick_params(axis='both', which='major', labelsize=20)
plt.axvline(x = gs.best_params_['C'], color ='green', linestyle = '--', label = 'best_param')
plt.legend()
plt.show()
Fit the model again but this time were setting C to the bestparams value
clf_gs_L2 = LogisticRegression(C = gs.best_params_['C'], random_state=0)
clf_gs_L2.fit(x_train,y_train)
# predition for training set using clf_best
y_pred_best_train = clf_gs_L2.predict(x_train)
# predition for test set using clf_best
y_pred_best = clf_gs_L2.predict(x_test)
# actual training values vs the predicted train value
l2_training_bestC = f1_score(y_train, y_pred_best_train)
# actual test values vs the predicted test value
l2_testing_bestC = f1_score(y_test, y_pred_best)
print('Training score: ' + str(l2_training_bestC))
print('Testing score: ' + str(l2_testing_bestC))
So far we see a slight improvements in our testing score without loosing too much training score. Very nice.
1.4 Evaluation¶
F1 Scores:
# hide/show
print('Default model(L2):')
print('Training score: ', + l2_training)
print('Testing score: ', + l2_testing)
print('\n')
print('Hypetuned Model(Best C)')
print('Training score: ' + str(l2_training_bestC))
print('Testing score: ' + str(l2_testing_bestC))
y_pred = clf_gs_L2.predict(x_test)
model_matrix = confusion_matrix(y_test, y_pred)
model_matrix
# Visualize
fig, ax = plt.subplots(figsize=(8,5))
# setting variables
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ['{0:0.0f}'.format(value) for value in model_matrix.flatten()]
group_percentages = ['{0:.2%}'.format(value) for value in model_matrix.flatten()/np.sum(model_matrix)]
labels = [f'{v1}\n{v2}\n{v3}' for v1, v2, v3 in zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(model_matrix, annot=labels, fmt='', cmap='Purples')
2. Linear Regression Model¶
Prediction for pH value
# Correlation between features and target
plt.figure(figsize=(8, 8))
corr_matrix = df.corr()
# Isolate the column corresponding to `Winnings`
corr_target = corr_matrix[['pH']].drop(labels=['pH'])
corr_target = corr_target.sort_values(by=['pH'], ascending=True)
sns.heatmap(corr_target, annot=True, fmt='.3', cmap='RdBu_r')
from sklearn.linear_model import LinearRegression, Lasso
# Define X and y
y = df[['pH']]
X = df[['fixed acidity', 'citric acid', 'density', 'chlorides']]
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 99)
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
lr = LinearRegression()
lr.fit(x_train, y_train)
Checking r2 score for linear regression model¶
print(lr.score(x_train,y_train))
print(lr.score(x_test,y_test))
2.1 Lasso Regularization for Linear Model¶
Check for the best alpha
using gridseachcv
## an array of alpha values between 0.000001 and .001
alpha_array = np.logspace(-6, -3, 50)
#dict with key (alpha) and values being alpha_array
tuned_parameters = [{'alpha': alpha_array}]
from sklearn.model_selection import GridSearchCV
gs_l1 = GridSearchCV(estimator = Lasso(),
param_grid = tuned_parameters,
scoring = 'neg_mean_squared_error',
cv = 5,
return_train_score = True)
# Gridsearch fit X and y
gs_l1.fit(X, y)
gs_l1.best_params_['alpha'], gs.best_score_
# Visualize
test_scores = gs_l1.cv_results_['mean_test_score']
train_scores = gs_l1.cv_results_['mean_train_score']
plt.figure(figsize=(10, 7))
plt.xlabel('Alpha array', fontsize=15)
plt.ylabel('score', fontsize=15)
plt.tick_params(axis='both', which='major', labelsize=15)
plt.axvline(x = gs_l1.best_params_['alpha'], color ='green', linestyle = '--', label = 'best_param')
plt.plot(alpha_array, train_scores, color='r', label='training score')
plt.plot(alpha_array, test_scores, color='b', label='testing score' )
plt.legend()
Fitting lasso regression model¶
lasso = Lasso(alpha = gs_l1.best_params_['alpha'])
lasso.fit(x_train, y_train)
Lasso R2 score¶
print(lasso.score(x_train,y_train))
print(lasso.score(x_test,y_test))
2.2 Ridge Regularization for Linear Model¶
alpha_array = np.logspace(-2, .2, 50)
#dict with key (alpha) and values being alpha_array
tuned_parameters = [{'alpha': alpha_array}]
gs_l2 = GridSearchCV(estimator = Ridge(),
param_grid = tuned_parameters,
scoring = 'neg_mean_squared_error',
cv = 5,
return_train_score = True)
# Gridsearch fit X and y
gs_l2.fit(X, y)
gs_l1.best_params_['alpha'], gs.best_score_
test_scores = gs_l2.cv_results_['mean_test_score']
train_scores = gs_l2.cv_results_['mean_train_score']
plt.figure(figsize=(10, 7))
plt.xlabel('Alpha array', fontsize=15)
plt.ylabel('score', fontsize=15)
plt.tick_params(axis='both', which='major', labelsize=15)
plt.axvline(x = gs_l2.best_params_['alpha'], color ='green', linestyle = '--', label = 'best_param')
plt.plot(alpha_array, train_scores, color='r', label='training score')
plt.plot(alpha_array, test_scores, color='b', label='testing score' )
plt.legend()
Fitting Ridge regression model¶
ridge = Ridge(alpha = gs_l2.best_params_['alpha'])
ridge.fit(x_train, y_train)
Ridge R2 score¶
print(ridge.score(x_train,y_train))
print(ridge.score(x_test,y_test))
# hide/show
print('r2 scores:')
print('\n')
print('Default linear regression:')
print('Training score: ', + lr.score(x_train,y_train))
print('Testing score: ', + lr.score(x_test,y_test))
print('\n')
print('Lasso:')
print('Training score: ' + str(lasso.score(x_train,y_train)))
print('Testing score: ' + str(lasso.score(x_test,y_test)))
print('\n')
print('Ridge:')
print('Training score: ' + str(ridge.score(x_train,y_train)))
print('Testing score: ' + str(ridge.score(x_test,y_test)))