Two-Sample T-Test¶
For an association between a Binary(two) Categorical Variable and a Quantitative Variable.
Suppose that a company is considering a new color-scheme for their website. They think that visitors will spend more time on the site if it is brightly colored. To test this theory, the company shows the old and new versions of the website to 50 site visitors, each — and finds that, on average, visitors spent 2 minutes longer on the new version compared to the old. Will this be true of future visitors as well? Or could this have happened by random chance among the 100 people in this sample?
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import ttest_ind
data = pd.read_csv('version_time.csv')
data.head()
Out[1]:
In [6]:
#separate out times for two versions
old = data.time_minutes[data.version=='old']
new = data.time_minutes[data.version=='new']
In [9]:
# Check if STD is equal
# a ratio between 0.9 and 1.1 should suffice
ratio = np.std(old) / np.std(new)
ratio
Out[9]:
In [6]:
#run the t-test here:
tstat, pval =ttest_ind(old,new)
pval
Out[6]:
The p-value is less than 0.05, we can conclude there is a significant difference.¶
In [7]:
#plot overlapping histograms
plt.hist(old, alpha=.8, label='old')
plt.hist(new, alpha=.8, label='new')
plt.legend()
plt.show()
In [5]:
import seaborn as sns
sns.boxplot(x = 'version', y = 'time_minutes', data= data)
Out[5]: