Hypothesis Testing1

FetchMaker

FetchMaker

The hottest new tech startup, FetchMaker. FetchMaker’s mission is to match up prospective dog owners with their perfect pet.

dogs2.jpg

FetchMaker estimates (based on historical data for all dogs) that 8% of dogs in their system are rescues.

Part I.

Rescued Whippets

  • Analyze if the rescued whippets (dog breed) are part of the 8% total dogs rescued.

Part II

Dogs Size

  • Mid-Sized Dog Weights. Is there a significant difference in the average weights of these three dog breeds?

Part III

Dogs Colors

  • Poodle and Shihtzu Colors differences.

Details:

  • weight: an integer representing how heavy a dog is in pounds
  • tail_length: a float representing tail length in inches
  • age: in years
  • color: a String such as “brown” or “grey”
  • is_rescue: a boolean 0 or 1
In [1]:
import numpy as np
import pandas as pd
from scipy.stats import binom_test, ttest_ind, f_oneway, chi2_contingency
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.stats.multicomp import pairwise_tukeyhsd
In [2]:
dogs = pd.read_csv('dog_data.csv')
dogs.head(2)
Out[2]:
is_rescue weight tail_length age color likes_children is_hypoallergenic name breed
0 0 6 2.25 2 black 1 0 Huey chihuahua
1 0 4 5.36 4 black 0 0 Cherish chihuahua

I. Rescued whippets.

There are 8% of rescued dogs (based on historical data for all dogs).

  • How many whippets are there in total?
  • How many whippets are rescued?
In [21]:
print('Total dogs :' + str(len(dogs)))
print('Total rescued dogs :' + str(np.sum(dogs['is_rescue']==1)))

# Setting Variables
whippets = dogs[dogs['breed'] == 'whippet']
rescue_whippet = np.sum(whippets['is_rescue'] == 1)
print('Total whippets: ' + str(len(whippets)))
print('Total rescue whippets: ' + str(rescue_whippet))
Total dogs :800
Total rescued dogs :73
Total whippets: 100
Total rescue whippets: 6

Hypothesis Test Binomial

Used for Binary Categorical Data to compare a sample frequency to an expected population-level probability

In records there is 6% recued whippets from all the whippets/ Let’s test the probability getting 8% recued whippets among whippets.

  • Null: 8% of whippets came from rescue. (6% and 8% are not significant different)
  • Alternative: more or less than 8% of whippets are rescues

Using binom_test function

In [73]:
# This Function create a simulation comparing our whippets distribution to a hypothetical distribution 
pval = binom_test(rescue_whippet , len(whippets), .08)
print(pval )
0.5811780106238105

We got very high Pval saying we must reject to fail the null hyphothesis thus the true recued whippets (6%) has no significant different from 8%.

Manual analysis

Creating a simulation to test out hypothesis

In [54]:
# We simulate a 100 sample whippets and list down the number of rescue into a list. 
# 8% is encoded to get a y (yes). Repeat this 10000 times.  Put the result in a list
null_outcomes = []

# We use random.choise between 'y' and 'n'(rescue and non_rescue) with 8% probability of getting a 'y'.
# our sample size is 100 (total whippets)
# repeat this 300 times
for i in range(300):
  simulated_whippets = np.random.choice(['y', 'n'], size=100, p=[0.08, 1-.08])
  num_rescued = np.sum(simulated_whippets == 'y')
  null_outcomes.append(num_rescued)

# Showing the first 10 results
null_outcomes[0:10]
Out[54]:
[11, 11, 8, 9, 8, 13, 7, 6, 8, 4]
In [85]:
plt.figure(figsize=(7,5.5))
plt.hist(null_outcomes)
plt.axvline(rescue_whippet       , color = 'r', linestyle='--', label ='observed rescue whippets')
plt.axvline((len(whippets) * .08), color = 'g', linestyle='--', label ='expected rescue whippets')

plt.axvline(np.percentile(null_outcomes, [2.5]), color = 'purple', linestyle='--', label ='2.5% percentile')
plt.axvline(np.percentile(null_outcomes, [97.5]), color = 'purple', linestyle='-', label ='97.5% percentile')



plt.legend(fontsize = 'small')
plt.show()

Confidence Interval (95%)

Our expected frequency should be in between 3.0 and 14.0.

In [86]:
# we subract the remaining 5% from both side of our null outcomes distributuin
np.percentile(null_outcomes, [2.5,97.5])
Out[86]:
array([ 3., 13.])

Our expected value for 8% rescued whippets should be in between 3 and 13 and we got 8. We are 95% confident to fail to reject the null hyphotesis thus 8% of whippets are came from rescue this is no significant difference from the real/actual rescued whippets.

P-value (two sided)

In [87]:
# Turn into array
null_outcomes = np.array(null_outcomes)
# expected value is 8, we got observed of 6. 
# 8 - 6 = 2, so to get the right value(right side of null distribution) we just add the difference from expected value. here we got 10
p_value_twoside = np.sum((null_outcomes <= 6) | (null_outcomes >= 10))/len(null_outcomes)
p_value_twoside
Out[87]:
0.5466666666666666

Our manual code for getting the P-value and using the binom_test function are similarly the same. I am 95% confident that the 8% of whippets breed are came from rescue.

Visualizing our Null Outcomes over 300 loop trials

In [88]:
plt.figure(figsize=(15,7))
plt.plot(range(300), null_outcomes, label='rescued whippets over 100 loops', )
plt.axhline(len(whippets) * 0.08, color = 'r', linestyle = '--', label='8% (expected)')
plt.axhline(rescue_whippet      , color = 'purple', linestyle='--', label ='rescued whippets', )

plt.xlabel("Number of trials")
plt.ylabel('Threshold')

plt.legend()
Out[88]:
<matplotlib.legend.Legend at 0x227b83f2910>

II. Dog Size’s.

Mid Sized Dog Weight (whippets, terriers and pitbulls)

Is there a significant difference in the average weights of these three dog breeds?

In [89]:
# Subset to just whippets, terriers, and pitbulls
dogs_wtp = dogs[dogs.breed.isin(['whippet', 'terrier', 'pitbull'])]
dogs_wtp.head(2)
Out[89]:
is_rescue weight tail_length age color likes_children is_hypoallergenic name breed
200 0 71 5.74 4 black 0 0 Charlot pitbull
201 0 26 11.56 3 black 0 0 Jud pitbull

Checking the weights distribution

These three breed dogs seems normally distributed, though there is a single outliers in Terrier breed, It’s not that heavily skewed that can stirred our analysis.

In [90]:
# Setting variables
pitbull_weight = dogs_wtp.weight[dogs_wtp['breed'] == 'pitbull']
terrier_weight = dogs_wtp.weight[dogs_wtp['breed']  == 'terrier']
whippet_weight = dogs_wtp.weight[dogs_wtp['breed']  == 'whippet']
In [92]:
plt.figure(figsize=(6,5))
plt.hist(pitbull_weight, label='Pitbull', density=True, alpha=.5 )
plt.hist(terrier_weight, label='Terrier', density=True, alpha=.5)
plt.hist(whippet_weight, label='Whippet', density=True, alpha=.5)
plt.legend()
Out[92]:
<matplotlib.legend.Legend at 0x227b8507cd0>

Hypothesis Anova and Turkey Tess

Test for an association between a quantitative variable and a non-binary categorical variable ANOVA tests the null hypothesis that all groups have the same population mean.

  • null: There is no significant difference in weight for Pitbulls, Terriers and Whippets.
  • alternate hypothesis: There is a significant difference in weight for Pitbulls, Terriers and Whippets.
In [93]:
fstat, pval = f_oneway(pitbull_weight,
                       terrier_weight,
                       whippet_weight)
print('Pvalue: ' + str('{:.20f}'.format(pval)))
Pvalue: 0.00000000000000003276

Checking the significant differences.

In [94]:
tukey_results =  pairwise_tukeyhsd(dogs_wtp.weight, dogs_wtp['breed'], 0.05)
print(tukey_results)
  Multiple Comparison of Means - Tukey HSD, FWER=0.05  
=======================================================
 group1  group2 meandiff p-adj   lower    upper  reject
-------------------------------------------------------
pitbull terrier   -13.24    0.0 -16.7278 -9.7522   True
pitbull whippet    -3.34 0.0638  -6.8278  0.1478  False
terrier whippet      9.9    0.0   6.4122 13.3878   True
-------------------------------------------------------

Visualize with boxplot

In [45]:
sns.boxplot(data=dogs_wtp, x='breed', y='weight')
plt.show()

Conclusion:

P-values less than 0.05 are significant.

True = Reject the Null and use the alternative hypothesis (significant)

False = Accept the null (not significant)

  1. Pitbull and terrier TRUE, significantly diffirent. Pitbulls are bigger than terriers their size is unlikely simliar.

  2. Pitbull and whippet FALSE, notsignificantly diffirent . Pitbulls and whippets has a common size.

  3. Terrier and whippet TRUE, significantly diffirent. Average Terriers are smaller than whippets.

Further analyzation

What if we remove/change the value of that single outliers in weight from Terrier breed? Does the outliers has a significant effect in our data?

Let see what will be the outcome.

Multi Test

In [95]:
tstat, pval =ttest_ind(terrier_weight, whippet_weight)
print('Pvalue: ' + str('{:.20f}'.format(pval)))
Pvalue: 0.00000000116992867788

Looking at the value_counts theres is a single observation under weight of 10. It’s very uncommon that theres a terrier that has a weight of 2lbs!. I will replace this with the terrier mean value.

In [96]:
terrier_weight.value_counts(ascending=True).head(3)
Out[96]:
45    1
2     1
41    1
Name: weight, dtype: int64
In [97]:
# replacing terrier weight below 2 with the mean value
terrier_weight = terrier_weight.where(terrier_weight > 3, terrier_weight.mean())
In [98]:
# Verify using graph
# Revisiting our terrier weights distribution
plt.hist(terrier_weight, label='Terrier', density=True, alpha=.5, color='orange')
plt.legend()
plt.title('Terrier Weight')
Out[98]:
Text(0.5, 1.0, 'Terrier Weight')
In [99]:
tstat, pval =ttest_ind(terrier_weight, whippet_weight)
print('Pvalue: ' + str('{:.20f}'.format(pval)))
Pvalue: 0.00000000173914212588

Further analyzation onclusion:

Though we replace the value of terriers outliers its doesnt seems to have an impact on our Pval. I stand with the initial result that terriers and whippets are significantly different in terms on weight.

III. Dogs Color

Poodle and Shihtzu Color

FetchMaker wants to know if ‘poodle’s and ‘shihtzu’s come in different colors.

In [100]:
# Setting Variables

# Subset to just poodles and shihtzus
dogs_ps = dogs[dogs.breed.isin(['poodle', 'shihtzu'])]

dogs_poodle = dogs_ps[dogs_ps['breed']=='poodle']
dogs_shihtzu = dogs_ps[dogs_ps['breed']=='shihtzu']

dogs_ps.head(3)
Out[100]:
is_rescue weight tail_length age color likes_children is_hypoallergenic name breed
300 0 58 8.05 1 black 1 0 Moise poodle
301 0 56 9.44 4 black 1 0 Boote poodle
302 1 59 4.04 4 black 1 0 Beatrix poodle

Chi Square test

Two Categorical Variables

In [102]:
Contingency_table = pd.crosstab(dogs_ps.breed, dogs_ps.color)
Contingency_table
Out[102]:
color black brown gold grey white
breed
poodle 17 13 8 52 10
shihtzu 10 36 6 41 7
In [106]:
# Setting variables
# I coppied data from the Contingency_table, this is more easier than doing it by code.
color = ['black', 'brown', 'gold', 'grey', 'white']
poodle = [17,13,8,52,10]  
shihtzu = [10,36,6,41,7]
In [104]:
# hide/unhide code
x = np.arange(len(color))  # the color locations
width = 0.35  # the width of the bars

fig, ax = plt.subplots(figsize=(7, 5))
rects1 = ax.bar(x - width/2, poodle, width, label='Poodle')
rects2 = ax.bar(x + width/2, shihtzu, width, label='Shihtzu')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Color Counts')
ax.set_title('Poodle and Shihtzu Colors')
ax.set_xticks(x)
ax.set_xticklabels(color)
ax.legend()

def autolabel(rects):
    """Attach a text label above each bar in *rects*, displaying its height."""
    for rect in rects:
        height = rect.get_height()
        ax.annotate('{}'.format(height),
                    xy=(rect.get_x() + rect.get_width() / 2, height),
                    xytext=(0, 3),  # 3 points vertical offset
                    textcoords="offset points",
                    ha='center', va='bottom')

autolabel(rects1)
autolabel(rects2)
fig.tight_layout()
In [105]:
chi2, pval, dof, expected = chi2_contingency(Contingency_table)
print('Pvalue: ' + str('{:.10f}'.format(pval)))
Pvalue: 0.0053024083

Poodle and shitzu has common colors.

Leave a Reply