Support Vertor Machine¶
Artificial Intellegence
About the datasoure
:
pybaseball is a Python package for baseball data analysis. This package scrapes Baseball Reference, Baseball Savant, and FanGraphs so you don’t have to. The package retrieves statcast data, pitching stats, batting stats, division standings/team records, awards data, and more. Data is available at the individual pitch level, as well as aggregated at the season level and over custom time periods.
github link: pybaseball
Project Goal:
- Create a Support Vector Machine model that will predict the game outcome if strike or ball based on playing habits.
Content:
- Import and Load Libraries/Dataset
- Defining Features and Label
- SVM
- Model Evaluation
- Parameter Optimization
- Optimized SVM
- Model Test Run
1. Import and Load Libraries/Dataset¶
from pybaseball import statcast, playerid_lookup, statcast_pitcher, pitching_stats
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)
Let’s extract some baseball player informartion
# look for player name `Kershaw Clayton`
playerid_lookup('kershaw', 'clayton')
# We were interested in his gamestats specificaly the year 2015-2016
# saving his game statistic in a variable df
df = statcast_pitcher('2015-01-01', '2016-01-01', 477132)
# show data
print(df.shape)
df.head(1)
2. Defining Features and Label¶
Determine our label (y)¶
We’re interested in looking at whether a ‘strike’ or a ‘ball’. That information is stored in the type
feature. Let’s Look at the unique values stored in the type
feature to get a sense of how balls and strikes are recorded.
df.type.unique()
Note:
About Strike and Ball
Strike
- Pitcher throw the ball into the strike zone and the batter did’nt swing.
- Pitcher throw the ball and the batter swing but didnt hit the ball. Does’nt matter if the ball is in strike zone or not.
- 3 strike then the batter is out.
Ball
- Pitcher throw the ball outside the strike zone and the batter didnt swing.
- 4 ball then the batter will go on the 1st base.
Reference: baseball rules video
We know every row’s type
feature is either an ‘S’ for a strike, a ‘B’ for a ball, or an ‘X’ for neither (for example, an ‘X’ could be a hit or an out).
We’ll want to use this feature as the label of our data points. However, instead of using strings,
it will be easier if we change every ‘S’ to a 1 and every ‘B’ to a 0. We can change the values of a DataFrame column using the map() functions.
df['type'] = df['type'].map({'S':1, 'B':0})
df.type.unique()
We will deal with the Nan value later, first let’s check the proportion and balance of our label values.
# count of 1 and 0 in type
# 1 is strike and 0 is ball
print(df['type'].value_counts())
There are 1969(around 60%) strikes and 1213(around 40%) balls. This distribution of strike and balls are pretty well balance.
Determine our features (X)¶
We want to predict whether a pitch is a ball or a strike based on its location over the plate. We can find the ball’s location in the columns plate_x and plate_z.
Note:
plate_x:
Measures how far left or right the pitch is from the center of home plate. If plate_x = 0, that means the pitch was directly in the middle of the home plate.plate_z:
Measures how high off the ground the pitch was. If plate_z = 0, that means the pitch was at ground level when it got to the home plate
For the sake of this project we will focus only building our SVM model though it may be better to do a PCA and other feature selection methods, we will save that task in some other time.
Let’s check and drop null values on our selected columns(plate_x, plate_z, and type). Our SVM won’t accept any null values.
# number of null values
df[['plate_x', 'plate_z', 'type']].isna().sum()
# proportion of null values
df[['plate_x', 'plate_z', 'type']].isna().sum() / len(df)
# remove every row that has a NaN in any of these columns and save result into a new dataset.
df_new = df.dropna(subset = ['plate_x', 'plate_z', 'type'])
# verify df_new
# proportion of null values
df_new[['plate_x', 'plate_z', 'type']].isna().sum() / len(df)
Visualization¶
plt.figure(figsize=(10,6))
sns.scatterplot(df['plate_x'], df['plate_z'], hue = df['type'], palette=['blue', 'red'], alpha = 0.5)
plt.legend()
We can see from the graph that the srike zone are mostly on the centers of the radius while balls are within the eadge area.
3. SVM¶
X = df_new[['plate_x', 'plate_z']]
y = df_new['type']
# Split data to training and testing dataset
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 0)
We will not perform normalization for this SVM Model, thats because we want the true value of our features to represent the location of the baseball ball in the plate x and z coordinate. Also because plate x and z are on the same unit of measurement so there will be no defference in scaling.¶
clf_svc = SVC(random_state = 0)
clf_svc.fit(x_train, y_train)
plt.figure(figsize=(10,6))
sns.scatterplot(x_train.plate_x, x_train.plate_z, hue = df['type'], palette=['blue', 'red'], alpha = 0.5)
4. Model Evaluation¶
# Accuracy score
print(clf_svc.score(x_train, y_train))
print(clf_svc.score(x_test, y_test))
Confusion Matrix¶
# x_test prediction values
y_pred = clf_svc.predict(x_test)
model_matrix = confusion_matrix(y_test, y_pred)
model_matrix
Visualize¶
# code
fig, ax = plt.subplots(figsize=(8,5))
# setting variables
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ['{0:0.0f}'.format(value) for value in model_matrix.flatten()]
group_percentages = ['{0:.2%}'.format(value) for value in model_matrix.flatten()/np.sum(model_matrix)]
labels = [f'{v1}\n{v2}\n{v3}' for v1, v2, v3 in zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(model_matrix, annot=labels, fmt='', cmap='Purples')
5. Parameter Optimization¶
Find the best ‘gamma’ and ‘C’ using GridSearvhCV.
grid_params = [{'gamma': ['scale',1, .1, .001, .0001],
'C': [0.5, 1, 10, 100] # Value for C must be > 0
}]
# Note:
# C default value is 1
# gamma default value is 'scale'
gs = GridSearchCV(estimator = clf_svc,
param_grid = grid_params,
scoring = 'accuracy',
cv = 5, )
gs.fit(x_train, y_train)
Show best parameter and score
print(gs.best_params_, gs.best_score_)
6. Optimized SVM¶
clf_svc = SVC(random_state = 0, C = 0.5, gamma = 'scale')
clf_svc.fit(x_train, y_train)
print(clf_svc.score(x_train, y_train))
print(clf_svc.score(x_test, y_test))
model_matrix = confusion_matrix(y_test, y_pred)
# code
fig, ax = plt.subplots(figsize=(8,5))
# setting variables
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ['{0:0.0f}'.format(value) for value in model_matrix.flatten()]
group_percentages = ['{0:.2%}'.format(value) for value in model_matrix.flatten()/np.sum(model_matrix)]
labels = [f'{v1}\n{v2}\n{v3}' for v1, v2, v3 in zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(model_matrix, annot=labels, fmt='', cmap='Purples')
Theres no signicant difference in our scores for our tuned and untuned model.
7. Model Test Run¶
We can copy values the below data or we can just made up our own. Look at the visualization graph above and think of a coordinate that will result to a strike or a ball.
df_new[['plate_x', 'plate_z', 'type']].head(4)
# from index1
Player1 = [[-0.08, 1.02]]
# this cordinate will surely hit s strike!
Player2 = [[0,2.5]]
print(clf_svc.predict(Player1))
print(clf_svc.predict(Player2))
-fin