Linear Regression¶
Salary Prediction
Data source: Association of Tennis Professionals (ATP)
Tennis Ace¶
Project Goals
- Create a linear regression model that predicts the earnings of tennis players based on their playing habits. By analyzing and modeling the Association of Tennis Professionals (ATP) data.
Target/Outcomes
- Winnings: total winnings in USD($) in a year
Data Dictionary:
–Not Available–
Content¶
- Import and Load Data
- Preliminary Linear Regression
- 2.1 Building a Linear Regression Model
- 2.2 Model Accuracy
- Feature Selection
- 3.1 Wrapper Methods
- 3.2 RFE
- 3.3 Select K best
- 3.4 Further Observation
- Final Linear Regression
- 4.1 Building a Linear Regression Model
- 4.2 Final Regression Evaluation
- 4.3 Test Run
- Conclusion
1.Import and Load Data¶
import pandas as pd
import numpy as np
from scipy.stats import pearsonr, skew
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.feature_selection import mutual_info_regression, RFE, SelectKBest, f_regression
from functools import partial
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
# load data
df = pd.read_csv('tennis_stats.csv')
print(df.shape)
df.head(2)
2. Preliminary Linear Regression¶
We will start by building a Linear Regression with all the features available and compare it later to our final logistic model.
2.1 Building a Linear Regression Model¶
Split the data into X
and y
¶
y = df['Winnings']
X = df.drop(columns=['Winnings', 'Player'])
Train Test split¶
Spliting our data to training 70% and testing 30%.
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 0)
Normalize the Data¶
Our features(X) must be in the same scale
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
Linear Regression Model¶
lr = LinearRegression()
Fit the model¶
Use the .fit()
method on lr
to fit the model to X
and y
.
lr.fit(x_train, y_train)
2.2 Model accuracy¶
Note:
- Overfitting, It fits the training data well but performs significantly worse on test data.
- When data is overfit, we do regularization (LAsso, Ridge)
R2 score¶
print('Training Score: ' + str(lr.score(x_train,y_train)))
print('Testing Score: ' + str(lr.score(x_test,y_test)))
Mean Squared Error (Training and Testing Dataset)¶
# high difference is an indicator of overfitting
# prediction from training set
pred_train = lr.predict(x_train)
# prediction from testing set
pred_test = lr.predict(x_test)
# MSE, prediction and actual from training set
MSE_train = np.mean((pred_train - y_train)**2)
print("Training Error: ", MSE_train)
# MSE, prediction and actual from testing set
MSE_test = np.mean((pred_test - y_test)**2)
print("Testing Error: ", MSE_test)
print("Difference %: ", abs((MSE_train - MSE_test) / MSE_train) )
First lets define a variables of our feature selection methods.
number_of_features
the number of featuresscoring_method
;- for classifiers {‘accuracy’, ‘f1’, ‘precision’, ‘recall’, ‘roc_auc’}
- for regressors {‘r2′, ‘max_error’,’neg_mean_squared_error’, ‘median_absolute_error’}
Documentation; https://scikit-learn.org/stable/modules/model_evaluation.html
Note: We’re fitting the the orignal X and y variables here instead of the training and testing data set.
We will us r2
as our scoring method.
Important formula:
y = actual values
Å· = predicted value
- r2 = 1 – (tss / rss)
- tss = sum (y – mean_of_y)^2
- rss = sum(y – Å·)^2
# Input variable here:
number_of_features = 8
scoring_method = 'r2'
Sequencial Forward Floating Selection¶
sffs = SFS(lr,
k_features=number_of_features,
forward= True,
floating= True,
scoring=scoring_method,
cv=0)
# Note: our X values here are not standarized or
# does sffs automatically standarized our features?
# should I just fit the x_train, y_train here?
sffs.fit(X, y)
print('SFFS selected features:')
# Saving a copy of sffs selected features in a list variable
X_sffs = list(sffs.subsets_[number_of_features]['feature_names'])
print(X_sffs)
print('\n' + str(scoring_method) + ': ' + str(sffs.subsets_[number_of_features]['avg_score']))
Sequencial Backward Floating Selection¶
sbfs = SFS(lr,
k_features=1,
forward= False,
floating= True,
scoring=scoring_method,
cv=0)
sbfs.fit(X, y)
print('SBFS selected features:')
# Saving a copy of sbfs selected features in a list variable
X_sbfs = list(sbfs.subsets_[1]['feature_names'])
print(X_sbfs)
print('\n' + str(scoring_method) + ': ' + str(sbfs.subsets_[1]['avg_score']))
Wins
has the higest r2 score.
Visualize SFFS and SBFS¶
# Visualize sffs
# Visualize in DataFrame (Optional, Uncomment to view)
# pd.DataFrame.from_dict(ssfs.get_metric_dict()).T
fig1 = plot_sfs(sffs.get_metric_dict(), kind='std_dev')
plt.title('Sequential Forward Floating Selection (w. StdDev)')
plt.grid()
plt.show()
#Visualize sbfs
fig2 = plot_sfs(sbfs.get_metric_dict(), kind='std_dev')
plt.title('')
plt.grid()
plt.show()
The graph shows that there’s not much variation from using 8 to 21 features though theres is a very very tiny increase using all the features, it’s not enough justification and using all may lead to redundant information. We can simplify by just using 8 features instead.
3.2 RFE¶
Ranking features by importance
, discarding the least important features, and re-fitting the model. This process is repeated until a specified number of features remains.
Before doing applying recursive feature elimination it is necessary to standardize the data.
# Setting a varible 'X_standard' so that our original X features will remain unaffected
# if we rerun this code
X_standard = StandardScaler().fit_transform(X)
rfe = RFE(lr, n_features_to_select=number_of_features)
rfe.fit(X_standard,y)
rfe_features = [f for (f, support) in zip(X, rfe.support_) if support]
print('RFE selected features:')
X_rfe = rfe_features
print(X_rfe)
print('\nrfe score: '+ str(rfe.score(X_standard,y)))
3.3 SelectKBest¶
We will be using the f_regression
and mutual_info_regression
for this test
Select K Best most common test methods;
f_classif
ANOVA F-value between label/feature for classification tasks.
mutual_info_classif
Mutual information for a discrete target.
chi2
Chi-squared stats of non-negative features for classification tasks.
f_regression
F-value between label/feature for regression tasks.
mutual_info_regression
Mutual information for a continuous target.
SelectKBest f_regression¶
# because we want to specify additional arguments (random_state=0)
# besides the features and targets inputs, we’ll need the help of the partial()
# score_function = partial(test_method, random_state=0)
selection = SelectKBest(score_func = f_regression, k = number_of_features)
# fit the fata
selection.fit_transform(X, y)
# saving a copy of chi2 slected features in a list
X_f_regression = X[X.columns[selection.get_support(indices=True)]]
X_f_regression = list(X_f_regression)
X_f_regression
SelectKBest mutual_info_regression¶
selection = SelectKBest(score_func = mutual_info_regression, k = number_of_features)
# fit the fata
selection.fit_transform(X, y)
# saving a copy of mutual_info_classifslected features in a list
X_mutual_info_regression = X[X.columns[selection.get_support(indices=True)]]
X_mutual_info_regression = list(X_mutual_info_regression)
X_mutual_info_regression
3.4 Further Observation¶
Almost half of our features are corellated to our target. However we want to minimize our features to avoid redundant information and multicolinearity as possible.
As observed our selection methods accept the losses
feature which in contrast with our general knowledge we expect the opposite.
Lets view it in a scatter plot.
plt.scatter(df['Losses'], df['Winnings'])
plt.title('Losses and Winnings')
plt.xlabel('Losses')
plt.ylabel('Winnings')
corr, p = pearsonr(df['Losses'], df['Winnings'])
corr, p
As seen in the graph, Losses
is 86.92% correlated to Winnings
. I suspect that number of losses and winnings contribute to overall number of ServiceGamesPlayed
.
The graph below show’s the correlation between wins, losses and service games played(number of game of a player).
# show plot
plt.figure(figsize=(15, 5))
ax1 = plt.subplot(1,2,1)
plt.scatter(df['ServiceGamesPlayed'], df['Losses'])
plt.xlabel('Service Games Played')
plt.ylabel('Losses')
ax1 = plt.subplot(1,2,2)
plt.scatter(df['ServiceGamesPlayed'], df['Wins'] )
plt.xlabel('Service Games Played')
plt.ylabel('Wins')
The number of ServiceGamesplayed
produces the number of Wins
and Losses
affecting our target Winnings
value. This is an example of duplicate information, we can drop the one that has lower correlation to our target
. We can view that information with the use of correlation heapmap.
# Correlation between features and target
plt.figure(figsize=(8, 8))
corr_matrix = df.corr()
# Isolate the column corresponding to `Winnings`
corr_target = corr_matrix[['Winnings']].drop(labels=['Winnings'])
corr_target = corr_target.sort_values(by=['Winnings'], ascending=False)
sns.heatmap(corr_target, annot=True, fmt='.3', cmap='RdBu_r')
The correlation table clearly display that Wins
has the higher correlation to Winnings
than losses
.
We will drop losses in this case and replace it with ReturnGamesPlayed
.
Creating our own feature list:
X_sffs
X_my_list =['Year',
'BreakPointsFaced',
'BreakPointsOpportunities',
'DoubleFaults',
'ServiceGamesPlayed',
'Wins',
# 'Losses',
'Ranking']
4. Final Linear Regression¶
Linear Regression with selected features
.
4.1 Building Final Logistic Regression¶
So far we have six different list of selected features, What we can do for now is to try each list and find the one which has the highest scores.
In this case we will using the X_my_list
.
- X_my_list
- X_sffs
- X_sbfs
- X_rfe
- X_f_regression
- X_mutual_info_regression
X = df[X_my_list]
y = df['Winnings']
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 0)
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
lr = LinearRegression()
lr.fit(x_train, y_train)
We have narrowed down our features from 21 to just 8 without sacrificing too much of our model performance.
4.2 Final Regression Evaluation¶
Our regresion model score.
# r2 score
# r2 ( 1- (TSS / RSS)
# TSS = how far each observation from mean
# sum of (each actual - its mean value)^2
# RSS = sum squared residual (lower the better)
print(lr.score(x_train,y_train))
print(lr.score(x_test,y_test))
# Mean squared error
# (actual - predicted)^2 / number of observation
# high difference is an indicator of overfitting
#Training Error
pred_train = lr.predict(x_train)
MSE_train = np.mean((pred_train - y_train)**2)
print("Training Error: ", MSE_train)
# 2. Testing Error
pred_test = lr.predict(x_test)
MSE_test = np.mean((pred_test - y_test)**2)
print("Testing Error: ", MSE_test)
print("Difference %: ", abs((MSE_train - MSE_test) / MSE_train) )
Using just 8 features our final regression model score has no significant difference from our preliminary model with all the features.
Visualization of actual and predicted earnings.¶
y_predict = lr.predict(x_test)
# visualization of actual vs. predicted value
plt.figure(figsize=(8, 6))
# y_test is 20% of our data
plt.scatter(y_test, y_predict, alpha=0.4, color = 'red')
plt.title("Actual Vs. Predicted Winnings")
plt.xlabel("Actual Winnings")
plt.ylabel("Predicted Winnings")
# show/hide code
a = y_test.values.reshape(-1,1).flatten()
b = y_predict.flatten()
diff = (b - a)
sim_data={"Actual Winnings":a, 'Predicted Winnings':b, 'Difference':np.round(diff,2)}
sim_data=pd.DataFrame(sim_data)
# Showing first 5 rows
sim_data.head(5)
4.3 Model Test Run¶
Let say we have a very talented tennis player named Alyx
. We extracted his gameplay record and save it in a variable alyx_stats.
- What will be
Alyx
earning in a year with his statistical record below?
Note: alyx_stats are based on df index number 465.
We intentionaly copy this stats so that we can compare it to the real value in pur sim_data
.
df[X_my_list].columns
# showing details for index 570
index_465 = df[df['Winnings'].isin([492583])]
index_465[['Year', 'BreakPointsFaced', 'BreakPointsOpportunities', 'DoubleFaults',
'ServiceGamesPlayed', 'Wins', 'Ranking']]
alyx_record = [[2015, 284, 252, 84, 450 ,16 ,24]]
# # transform alyx_stats
alyx_record = scaler.transform(alyx_record)
alyx_record
alyx_earning = lr.predict(alyx_record)
alyx_earning
sim_data.iloc[[0]]
5. Conclusion¶
We are comparing continuous features and for this dataset the best method for feature selection is to check for their correlations.
Summary:
- We build a preliminary linear regression with all the features included.
- Used various feature selection methods combined with our general knowledge to select the best features that most describe our target. Narrowed it down from 24 to just only 8 features.
- Re create our final linear regression model and compare it’s performance with our preliminary model.
- Finally, we evaluate and test run our final regression model and got an 89% r2 score on both training and testing dataset.
– fin