FetchMaker¶
The hottest new tech startup, FetchMaker. FetchMaker’s mission is to match up prospective dog owners with their perfect pet.
FetchMaker estimates (based on historical data for all dogs) that 8% of dogs in their system are rescues.
Part I.¶
Rescued Whippets
- Analyze if the rescued whippets (dog breed) are part of the 8% total dogs rescued.
Part II¶
Dogs Size
- Mid-Sized Dog Weights. Is there a significant difference in the average weights of these three dog breeds?
Part III¶
Dogs Colors
- Poodle and Shihtzu Colors differences.
Details
:
weight:
an integer representing how heavy a dog is in poundstail_length:
a float representing tail length in inchesage:
in yearscolor:
a String such as “brown” or “grey”is_rescue:
a boolean 0 or 1
import numpy as np
import pandas as pd
from scipy.stats import binom_test, ttest_ind, f_oneway, chi2_contingency
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.stats.multicomp import pairwise_tukeyhsd
dogs = pd.read_csv('dog_data.csv')
dogs.head(2)
I. Rescued whippets. ¶
There are 8% of rescued dogs (based on historical data for all dogs).
- How many whippets are there in total?
- How many whippets are rescued?
print('Total dogs :' + str(len(dogs)))
print('Total rescued dogs :' + str(np.sum(dogs['is_rescue']==1)))
# Setting Variables
whippets = dogs[dogs['breed'] == 'whippet']
rescue_whippet = np.sum(whippets['is_rescue'] == 1)
print('Total whippets: ' + str(len(whippets)))
print('Total rescue whippets: ' + str(rescue_whippet))
Hypothesis Test Binomial¶
Used for Binary Categorical Data to compare a sample frequency to an expected population-level probability
In records there is 6% recued whippets from all the whippets/ Let’s test the probability getting 8% recued whippets among whippets.
Null:
8% of whippets came from rescue. (6% and 8% are not significant different)Alternative:
more or less than 8% of whippets are rescues
Using binom_test function¶
# This Function create a simulation comparing our whippets distribution to a hypothetical distribution
pval = binom_test(rescue_whippet , len(whippets), .08)
print(pval )
We got very high Pval saying we must reject to fail the null hyphothesis thus the true recued whippets (6%) has no significant different from 8%.
Manual analysis¶
Creating a simulation to test out hypothesis
# We simulate a 100 sample whippets and list down the number of rescue into a list.
# 8% is encoded to get a y (yes). Repeat this 10000 times. Put the result in a list
null_outcomes = []
# We use random.choise between 'y' and 'n'(rescue and non_rescue) with 8% probability of getting a 'y'.
# our sample size is 100 (total whippets)
# repeat this 300 times
for i in range(300):
simulated_whippets = np.random.choice(['y', 'n'], size=100, p=[0.08, 1-.08])
num_rescued = np.sum(simulated_whippets == 'y')
null_outcomes.append(num_rescued)
# Showing the first 10 results
null_outcomes[0:10]
plt.figure(figsize=(7,5.5))
plt.hist(null_outcomes)
plt.axvline(rescue_whippet , color = 'r', linestyle='--', label ='observed rescue whippets')
plt.axvline((len(whippets) * .08), color = 'g', linestyle='--', label ='expected rescue whippets')
plt.axvline(np.percentile(null_outcomes, [2.5]), color = 'purple', linestyle='--', label ='2.5% percentile')
plt.axvline(np.percentile(null_outcomes, [97.5]), color = 'purple', linestyle='-', label ='97.5% percentile')
plt.legend(fontsize = 'small')
plt.show()
Confidence Interval (95%)¶
Our expected frequency should be in between 3.0 and 14.0.
# we subract the remaining 5% from both side of our null outcomes distributuin
np.percentile(null_outcomes, [2.5,97.5])
Our expected value for 8% rescued whippets should be in between 3 and 13 and we got 8. We are 95% confident to fail to reject the null hyphotesis thus 8% of whippets are came from rescue this is no significant difference from the real/actual rescued whippets.
P-value (two sided)¶
# Turn into array
null_outcomes = np.array(null_outcomes)
# expected value is 8, we got observed of 6.
# 8 - 6 = 2, so to get the right value(right side of null distribution) we just add the difference from expected value. here we got 10
p_value_twoside = np.sum((null_outcomes <= 6) | (null_outcomes >= 10))/len(null_outcomes)
p_value_twoside
Our manual code for getting the P-value and using the binom_test function are similarly the same. I am 95% confident that the 8% of whippets breed are came from rescue.
Visualizing our Null Outcomes over 300 loop trials¶
plt.figure(figsize=(15,7))
plt.plot(range(300), null_outcomes, label='rescued whippets over 100 loops', )
plt.axhline(len(whippets) * 0.08, color = 'r', linestyle = '--', label='8% (expected)')
plt.axhline(rescue_whippet , color = 'purple', linestyle='--', label ='rescued whippets', )
plt.xlabel("Number of trials")
plt.ylabel('Threshold')
plt.legend()
II. Dog Size’s. ¶
Mid Sized Dog Weight (whippets, terriers and pitbulls)
Is there a significant difference in the average weights of these three dog breeds?
# Subset to just whippets, terriers, and pitbulls
dogs_wtp = dogs[dogs.breed.isin(['whippet', 'terrier', 'pitbull'])]
dogs_wtp.head(2)
Checking the weights distribution¶
These three breed dogs seems normally distributed, though there is a single outliers in Terrier breed, It’s not that heavily skewed that can stirred our analysis.
# Setting variables
pitbull_weight = dogs_wtp.weight[dogs_wtp['breed'] == 'pitbull']
terrier_weight = dogs_wtp.weight[dogs_wtp['breed'] == 'terrier']
whippet_weight = dogs_wtp.weight[dogs_wtp['breed'] == 'whippet']
plt.figure(figsize=(6,5))
plt.hist(pitbull_weight, label='Pitbull', density=True, alpha=.5 )
plt.hist(terrier_weight, label='Terrier', density=True, alpha=.5)
plt.hist(whippet_weight, label='Whippet', density=True, alpha=.5)
plt.legend()
Hypothesis Anova and Turkey Tess¶
Test for an association between a quantitative variable and a non-binary categorical variable
ANOVA tests the null hypothesis that all groups have the same population mean
.
null:
There is no significant difference in weight for Pitbulls, Terriers and Whippets.alternate hypothesis:
There is a significant difference in weight for Pitbulls, Terriers and Whippets.
fstat, pval = f_oneway(pitbull_weight,
terrier_weight,
whippet_weight)
print('Pvalue: ' + str('{:.20f}'.format(pval)))
Checking the significant differences.¶
tukey_results = pairwise_tukeyhsd(dogs_wtp.weight, dogs_wtp['breed'], 0.05)
print(tukey_results)
Visualize with boxplot¶
sns.boxplot(data=dogs_wtp, x='breed', y='weight')
plt.show()
Conclusion:¶
P-values less than 0.05 are significant.
True = Reject the Null and use the alternative hypothesis (significant)
False = Accept the null (not significant)
Pitbull and terrier
TRUE, significantly diffirent. Pitbulls are bigger than terriers their size is unlikely simliar.Pitbull and whippet
FALSE, notsignificantly diffirent . Pitbulls and whippets has a common size.Terrier and whippet
TRUE, significantly diffirent. Average Terriers are smaller than whippets.
Further analyzation¶
What if we remove/change the value of that single outliers in weight from Terrier breed? Does the outliers has a significant effect in our data?
Let see what will be the outcome.
Multi Test¶
tstat, pval =ttest_ind(terrier_weight, whippet_weight)
print('Pvalue: ' + str('{:.20f}'.format(pval)))
Looking at the value_counts theres is a single observation under weight of 10. It’s very uncommon that theres a terrier that has a weight of 2lbs!. I will replace this with the terrier mean value.
terrier_weight.value_counts(ascending=True).head(3)
# replacing terrier weight below 2 with the mean value
terrier_weight = terrier_weight.where(terrier_weight > 3, terrier_weight.mean())
# Verify using graph
# Revisiting our terrier weights distribution
plt.hist(terrier_weight, label='Terrier', density=True, alpha=.5, color='orange')
plt.legend()
plt.title('Terrier Weight')
tstat, pval =ttest_ind(terrier_weight, whippet_weight)
print('Pvalue: ' + str('{:.20f}'.format(pval)))
Further analyzation onclusion:¶
Though we replace the value of terriers outliers its doesnt seems to have an impact on our Pval. I stand with the initial result that terriers and whippets are significantly different in terms on weight.
III. Dogs Color ¶
Poodle and Shihtzu Color
FetchMaker wants to know if ‘poodle’s and ‘shihtzu’s come in different colors.
# Setting Variables
# Subset to just poodles and shihtzus
dogs_ps = dogs[dogs.breed.isin(['poodle', 'shihtzu'])]
dogs_poodle = dogs_ps[dogs_ps['breed']=='poodle']
dogs_shihtzu = dogs_ps[dogs_ps['breed']=='shihtzu']
dogs_ps.head(3)
Chi Square test¶
Two Categorical Variables
Contingency_table = pd.crosstab(dogs_ps.breed, dogs_ps.color)
Contingency_table
# Setting variables
# I coppied data from the Contingency_table, this is more easier than doing it by code.
color = ['black', 'brown', 'gold', 'grey', 'white']
poodle = [17,13,8,52,10]
shihtzu = [10,36,6,41,7]
# hide/unhide code
x = np.arange(len(color)) # the color locations
width = 0.35 # the width of the bars
fig, ax = plt.subplots(figsize=(7, 5))
rects1 = ax.bar(x - width/2, poodle, width, label='Poodle')
rects2 = ax.bar(x + width/2, shihtzu, width, label='Shihtzu')
# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Color Counts')
ax.set_title('Poodle and Shihtzu Colors')
ax.set_xticks(x)
ax.set_xticklabels(color)
ax.legend()
def autolabel(rects):
"""Attach a text label above each bar in *rects*, displaying its height."""
for rect in rects:
height = rect.get_height()
ax.annotate('{}'.format(height),
xy=(rect.get_x() + rect.get_width() / 2, height),
xytext=(0, 3), # 3 points vertical offset
textcoords="offset points",
ha='center', va='bottom')
autolabel(rects1)
autolabel(rects2)
fig.tight_layout()
chi2, pval, dof, expected = chi2_contingency(Contingency_table)
print('Pvalue: ' + str('{:.10f}'.format(pval)))
Poodle and shitzu has common colors.