Hypothesis Testing1

FetchMaker

FetchMaker

The hottest new tech startup, FetchMaker. FetchMaker’s mission is to match up prospective dog owners with their perfect pet.

dogs2.jpg

FetchMaker estimates (based on historical data for all dogs) that 8% of dogs in their system are rescues.

Part I.

Rescued Whippets

  • Analyze if the rescued whippets (dog breed) are part of the 8% total dogs rescued.

Part II

Dogs Size

  • Mid-Sized Dog Weights. Is there a significant difference in the average weights of these three dog breeds?

Part III

Dogs Colors

  • Poodle and Shihtzu Colors differences.

Details:

  • weight: an integer representing how heavy a dog is in pounds
  • tail_length: a float representing tail length in inches
  • age: in years
  • color: a String such as “brown” or “grey”
  • is_rescue: a boolean 0 or 1
In [1]:
import numpy as np
import pandas as pd
from scipy.stats import binom_test, ttest_ind, f_oneway, chi2_contingency
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.stats.multicomp import pairwise_tukeyhsd
In [2]:
dogs = pd.read_csv('dog_data.csv')
dogs.head(2)
Out[2]:
is_rescue weight tail_length age color likes_children is_hypoallergenic name breed
0 0 6 2.25 2 black 1 0 Huey chihuahua
1 0 4 5.36 4 black 0 0 Cherish chihuahua

I. Rescued whippets.

There are 8% of rescued dogs (based on historical data for all dogs).

  • How many whippets are there in total?
  • How many whippets are rescued?
In [21]:
print('Total dogs :' + str(len(dogs)))
print('Total rescued dogs :' + str(np.sum(dogs['is_rescue']==1)))

# Setting Variables
whippets = dogs[dogs['breed'] == 'whippet']
rescue_whippet = np.sum(whippets['is_rescue'] == 1)
print('Total whippets: ' + str(len(whippets)))
print('Total rescue whippets: ' + str(rescue_whippet))
Total dogs :800
Total rescued dogs :73
Total whippets: 100
Total rescue whippets: 6

Hypothesis Test Binomial

Used for Binary Categorical Data to compare a sample frequency to an expected population-level probability

In records there is 6% recued whippets from all the whippets/ Let’s test the probability getting 8% recued whippets among whippets.

  • Null: 8% of whippets came from rescue. (6% and 8% are not significant different)
  • Alternative: more or less than 8% of whippets are rescues

Using binom_test function

In [73]:
# This Function create a simulation comparing our whippets distribution to a hypothetical distribution 
pval = binom_test(rescue_whippet , len(whippets), .08)
print(pval )
0.5811780106238105

We got very high Pval saying we must reject to fail the null hyphothesis thus the true recued whippets (6%) has no significant different from 8%.

Manual analysis

Creating a simulation to test out hypothesis

In [54]:
# We simulate a 100 sample whippets and list down the number of rescue into a list. 
# 8% is encoded to get a y (yes). Repeat this 10000 times.  Put the result in a list
null_outcomes = []

# We use random.choise between 'y' and 'n'(rescue and non_rescue) with 8% probability of getting a 'y'.
# our sample size is 100 (total whippets)
# repeat this 300 times
for i in range(300):
  simulated_whippets = np.random.choice(['y', 'n'], size=100, p=[0.08, 1-.08])
  num_rescued = np.sum(simulated_whippets == 'y')
  null_outcomes.append(num_rescued)

# Showing the first 10 results
null_outcomes[0:10]
Out[54]:
[11, 11, 8, 9, 8, 13, 7, 6, 8, 4]
In [85]:
plt.figure(figsize=(7,5.5))
plt.hist(null_outcomes)
plt.axvline(rescue_whippet       , color = 'r', linestyle='--', label ='observed rescue whippets')
plt.axvline((len(whippets) * .08), color = 'g', linestyle='--', label ='expected rescue whippets')

plt.axvline(np.percentile(null_outcomes, [2.5]), color = 'purple', linestyle='--', label ='2.5% percentile')
plt.axvline(np.percentile(null_outcomes, [97.5]), color = 'purple', linestyle='-', label ='97.5% percentile')



plt.legend(fontsize = 'small')
plt.show()

Confidence Interval (95%)

Our expected frequency should be in between 3.0 and 14.0.

In [86]:
# we subract the remaining 5% from both side of our null outcomes distributuin
np.percentile(null_outcomes, [2.5,97.5])
Out[86]:
array([ 3., 13.])

Our expected value for 8% rescued whippets should be in between 3 and 13 and we got 8. We are 95% confident to fail to reject the null hyphotesis thus 8% of whippets are came from rescue this is no significant difference from the real/actual rescued whippets.

P-value (two sided)

In [87]:
# Turn into array
null_outcomes = np.array(null_outcomes)
# expected value is 8, we got observed of 6. 
# 8 - 6 = 2, so to get the right value(right side of null distribution) we just add the difference from expected value. here we got 10
p_value_twoside = np.sum((null_outcomes <= 6) | (null_outcomes >= 10))/len(null_outcomes)
p_value_twoside
Out[87]:
0.5466666666666666

Our manual code for getting the P-value and using the binom_test function are similarly the same. I am 95% confident that the 8% of whippets breed are came from rescue.

Visualizing our Null Outcomes over 300 loop trials

In [88]:
plt.figure(figsize=(15,7))
plt.plot(range(300), null_outcomes, label='rescued whippets over 100 loops', )
plt.axhline(len(whippets) * 0.08, color = 'r', linestyle = '--', label='8% (expected)')
plt.axhline(rescue_whippet      , color = 'purple', linestyle='--', label ='rescued whippets', )

plt.xlabel("Number of trials")
plt.ylabel('Threshold')

plt.legend()
Out[88]:
<matplotlib.legend.Legend at 0x227b83f2910>

II. Dog Size’s.

Mid Sized Dog Weight (whippets, terriers and pitbulls)

Is there a significant difference in the average weights of these three dog breeds?

In [89]:
# Subset to just whippets, terriers, and pitbulls
dogs_wtp = dogs[dogs.breed.isin(['whippet', 'terrier', 'pitbull'])]
dogs_wtp.head(2)
Out[89]:
is_rescue weight tail_length age color likes_children is_hypoallergenic name breed
200 0 71 5.74 4 black 0 0 Charlot pitbull
201 0 26 11.56 3 black 0 0 Jud pitbull

Checking the weights distribution

These three breed dogs seems normally distributed, though there is a single outliers in Terrier breed, It’s not that heavily skewed that can stirred our analysis.

In [90]:
# Setting variables
pitbull_weight = dogs_wtp.weight[dogs_wtp['breed'] == 'pitbull']
terrier_weight = dogs_wtp.weight[dogs_wtp['breed']  == 'terrier']
whippet_weight = dogs_wtp.weight[dogs_wtp['breed']  == 'whippet']
In [92]:
plt.figure(figsize=(6,5))
plt.hist(pitbull_weight, label='Pitbull', density=True, alpha=.5 )
plt.hist(terrier_weight, label='Terrier', density=True, alpha=.5)
plt.hist(whippet_weight, label='Whippet', density=True, alpha=.5)
plt.legend()
Out[92]:
<matplotlib.legend.Legend at 0x227b8507cd0>