FetchMaker¶

The hottest new tech startup, FetchMaker. FetchMaker’s mission is to match up prospective dog owners with their perfect pet.

FetchMaker estimates (based on historical data for all dogs) that 8% of dogs in their system are rescues.

Part I.¶

Rescued Whippets

Analyze if the rescued whippets (dog breed) are part of the 8% total dogs rescued.

Part II¶

Dogs Size

Mid-Sized Dog Weights. Is there a significant difference in the average weights of these three dog breeds?

Part III¶

Dogs Colors

Poodle and Shihtzu Colors differences.

Details:

weight: an integer representing how heavy a dog is in pounds
tail_length: a float representing tail length in inches
age: in years
color: a String such as “brown” or “grey”
is_rescue: a boolean 0 or 1

import numpy as np
import pandas as pd
from scipy.stats import binom_test, ttest_ind, f_oneway, chi2_contingency
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.stats.multicomp import pairwise_tukeyhsd

dogs = pd.read_csv('dog_data.csv')
dogs.head(2)

I. Rescued whippets. ¶

There are 8% of rescued dogs (based on historical data for all dogs).

How many whippets are there in total?
How many whippets are rescued?

print('Total dogs :' + str(len(dogs)))
print('Total rescued dogs :' + str(np.sum(dogs['is_rescue']==1)))

# Setting Variables
whippets = dogs[dogs['breed'] == 'whippet']
rescue_whippet = np.sum(whippets['is_rescue'] == 1)
print('Total whippets: ' + str(len(whippets)))
print('Total rescue whippets: ' + str(rescue_whippet))

Total dogs :800
Total rescued dogs :73
Total whippets: 100
Total rescue whippets: 6

Hypothesis Test Binomial¶

Used for Binary Categorical Data to compare a sample frequency to an expected population-level probability

In records there is 6% recued whippets from all the whippets/ Let’s test the probability getting 8% recued whippets among whippets.

Null: 8% of whippets came from rescue. (6% and 8% are not significant different)
Alternative: more or less than 8% of whippets are rescues

Using binom_test function¶

# This Function create a simulation comparing our whippets distribution to a hypothetical distribution 
pval = binom_test(rescue_whippet , len(whippets), .08)
print(pval )

0.5811780106238105

We got very high Pval saying we must reject to fail the null hyphothesis thus the true recued whippets (6%) has no significant different from 8%.

Manual analysis¶

Creating a simulation to test out hypothesis

# We simulate a 100 sample whippets and list down the number of rescue into a list. 
# 8% is encoded to get a y (yes). Repeat this 10000 times.  Put the result in a list
null_outcomes = []

# We use random.choise between 'y' and 'n'(rescue and non_rescue) with 8% probability of getting a 'y'.
# our sample size is 100 (total whippets)
# repeat this 300 times
for i in range(300):
  simulated_whippets = np.random.choice(['y', 'n'], size=100, p=[0.08, 1-.08])
  num_rescued = np.sum(simulated_whippets == 'y')
  null_outcomes.append(num_rescued)

# Showing the first 10 results
null_outcomes[0:10]

[11, 11, 8, 9, 8, 13, 7, 6, 8, 4]

plt.figure(figsize=(7,5.5))
plt.hist(null_outcomes)
plt.axvline(rescue_whippet       , color = 'r', linestyle='--', label ='observed rescue whippets')
plt.axvline((len(whippets) * .08), color = 'g', linestyle='--', label ='expected rescue whippets')

plt.axvline(np.percentile(null_outcomes, [2.5]), color = 'purple', linestyle='--', label ='2.5% percentile')
plt.axvline(np.percentile(null_outcomes, [97.5]), color = 'purple', linestyle='-', label ='97.5% percentile')



plt.legend(fontsize = 'small')
plt.show()

Confidence Interval (95%)¶

Our expected frequency should be in between 3.0 and 14.0.

# we subract the remaining 5% from both side of our null outcomes distributuin
np.percentile(null_outcomes, [2.5,97.5])

array([ 3., 13.])

Our expected value for 8% rescued whippets should be in between 3 and 13 and we got 8. We are 95% confident to fail to reject the null hyphotesis thus 8% of whippets are came from rescue this is no significant difference from the real/actual rescued whippets.

P-value (two sided)¶

# Turn into array
null_outcomes = np.array(null_outcomes)
# expected value is 8, we got observed of 6. 
# 8 - 6 = 2, so to get the right value(right side of null distribution) we just add the difference from expected value. here we got 10
p_value_twoside = np.sum((null_outcomes <= 6) | (null_outcomes >= 10))/len(null_outcomes)
p_value_twoside

0.5466666666666666

Our manual code for getting the P-value and using the binom_test function are similarly the same. I am 95% confident that the 8% of whippets breed are came from rescue.

Visualizing our Null Outcomes over 300 loop trials¶

plt.figure(figsize=(15,7))
plt.plot(range(300), null_outcomes, label='rescued whippets over 100 loops', )
plt.axhline(len(whippets) * 0.08, color = 'r', linestyle = '--', label='8% (expected)')
plt.axhline(rescue_whippet      , color = 'purple', linestyle='--', label ='rescued whippets', )

plt.xlabel("Number of trials")
plt.ylabel('Threshold')

plt.legend()

<matplotlib.legend.Legend at 0x227b83f2910>

II. Dog Size’s. ¶

Mid Sized Dog Weight (whippets, terriers and pitbulls)

Is there a significant difference in the average weights of these three dog breeds?

# Subset to just whippets, terriers, and pitbulls
dogs_wtp = dogs[dogs.breed.isin(['whippet', 'terrier', 'pitbull'])]
dogs_wtp.head(2)

Checking the weights distribution¶

These three breed dogs seems normally distributed, though there is a single outliers in Terrier breed, It’s not that heavily skewed that can stirred our analysis.

# Setting variables
pitbull_weight = dogs_wtp.weight[dogs_wtp['breed'] == 'pitbull']
terrier_weight = dogs_wtp.weight[dogs_wtp['breed']  == 'terrier']
whippet_weight = dogs_wtp.weight[dogs_wtp['breed']  == 'whippet']

plt.figure(figsize=(6,5))
plt.hist(pitbull_weight, label='Pitbull', density=True, alpha=.5 )
plt.hist(terrier_weight, label='Terrier', density=True, alpha=.5)
plt.hist(whippet_weight, label='Whippet', density=True, alpha=.5)
plt.legend()

<matplotlib.legend.Legend at 0x227b8507cd0>

Hypothesis Anova and Turkey Tess¶

Test for an association between a quantitative variable and a non-binary categorical variable ANOVA tests the null hypothesis that all groups have the same population mean.

null: There is no significant difference in weight for Pitbulls, Terriers and Whippets.
alternate hypothesis: There is a significant difference in weight for Pitbulls, Terriers and Whippets.

fstat, pval = f_oneway(pitbull_weight,
                       terrier_weight,
                       whippet_weight)
print('Pvalue: ' + str('{:.20f}'.format(pval)))

Pvalue: 0.00000000000000003276

Checking the significant differences.¶

tukey_results =  pairwise_tukeyhsd(dogs_wtp.weight, dogs_wtp['breed'], 0.05)
print(tukey_results)

  Multiple Comparison of Means - Tukey HSD, FWER=0.05  
=======================================================
 group1  group2 meandiff p-adj   lower    upper  reject
-------------------------------------------------------
pitbull terrier   -13.24    0.0 -16.7278 -9.7522   True
pitbull whippet    -3.34 0.0638  -6.8278  0.1478  False
terrier whippet      9.9    0.0   6.4122 13.3878   True
-------------------------------------------------------

Visualize with boxplot¶

sns.boxplot(data=dogs_wtp, x='breed', y='weight')
plt.show()

Conclusion:¶

P-values less than 0.05 are significant.

True = Reject the Null and use the alternative hypothesis (significant)

False = Accept the null (not significant)

Pitbull and terrier TRUE, significantly diffirent. Pitbulls are bigger than terriers their size is unlikely simliar.
Pitbull and whippet FALSE, notsignificantly diffirent . Pitbulls and whippets has a common size.
Terrier and whippet TRUE, significantly diffirent. Average Terriers are smaller than whippets.

Further analyzation¶

What if we remove/change the value of that single outliers in weight from Terrier breed? Does the outliers has a significant effect in our data?

Let see what will be the outcome.

Multi Test¶

tstat, pval =ttest_ind(terrier_weight, whippet_weight)
print('Pvalue: ' + str('{:.20f}'.format(pval)))

Pvalue: 0.00000000116992867788

Looking at the value_counts theres is a single observation under weight of 10. It’s very uncommon that theres a terrier that has a weight of 2lbs!. I will replace this with the terrier mean value.

terrier_weight.value_counts(ascending=True).head(3)

45    1
2     1
41    1
Name: weight, dtype: int64

# replacing terrier weight below 2 with the mean value
terrier_weight = terrier_weight.where(terrier_weight > 3, terrier_weight.mean())

# Verify using graph
# Revisiting our terrier weights distribution
plt.hist(terrier_weight, label='Terrier', density=True, alpha=.5, color='orange')
plt.legend()
plt.title('Terrier Weight')

Text(0.5, 1.0, 'Terrier Weight')

tstat, pval =ttest_ind(terrier_weight, whippet_weight)
print('Pvalue: ' + str('{:.20f}'.format(pval)))

Pvalue: 0.00000000173914212588

Further analyzation onclusion:¶

Though we replace the value of terriers outliers its doesnt seems to have an impact on our Pval. I stand with the initial result that terriers and whippets are significantly different in terms on weight.

III. Dogs Color ¶

Poodle and Shihtzu Color

FetchMaker wants to know if ‘poodle’s and ‘shihtzu’s come in different colors.

# Setting Variables

# Subset to just poodles and shihtzus
dogs_ps = dogs[dogs.breed.isin(['poodle', 'shihtzu'])]

dogs_poodle = dogs_ps[dogs_ps['breed']=='poodle']
dogs_shihtzu = dogs_ps[dogs_ps['breed']=='shihtzu']

dogs_ps.head(3)

Chi Square test¶

Two Categorical Variables

Contingency_table = pd.crosstab(dogs_ps.breed, dogs_ps.color)
Contingency_table

# Setting variables
# I coppied data from the Contingency_table, this is more easier than doing it by code.
color = ['black', 'brown', 'gold', 'grey', 'white']
poodle = [17,13,8,52,10]  
shihtzu = [10,36,6,41,7]

# hide/unhide code
x = np.arange(len(color))  # the color locations
width = 0.35  # the width of the bars

fig, ax = plt.subplots(figsize=(7, 5))
rects1 = ax.bar(x - width/2, poodle, width, label='Poodle')
rects2 = ax.bar(x + width/2, shihtzu, width, label='Shihtzu')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Color Counts')
ax.set_title('Poodle and Shihtzu Colors')
ax.set_xticks(x)
ax.set_xticklabels(color)
ax.legend()

def autolabel(rects):
    """Attach a text label above each bar in *rects*, displaying its height."""
    for rect in rects:
        height = rect.get_height()
        ax.annotate('{}'.format(height),
                    xy=(rect.get_x() + rect.get_width() / 2, height),
                    xytext=(0, 3),  # 3 points vertical offset
                    textcoords="offset points",
                    ha='center', va='bottom')

autolabel(rects1)
autolabel(rects2)
fig.tight_layout()

chi2, pval, dof, expected = chi2_contingency(Contingency_table)
print('Pvalue: ' + str('{:.10f}'.format(pval)))

Pvalue: 0.0053024083

Poodle and shitzu has common colors.

	is_rescue	weight	tail_length	age	color	likes_children	is_hypoallergenic	name	breed
0	0	6	2.25	2	black	1	0	Huey	chihuahua
1	0	4	5.36	4	black	0	0	Cherish	chihuahua

	is_rescue	weight	tail_length	age	color	likes_children	is_hypoallergenic	name	breed
200	0	71	5.74	4	black	0	0	Charlot	pitbull
201	0	26	11.56	3	black	0	0	Jud	pitbull

Hypothesis Testing1

FetchMaker¶

Part I.¶

Part II¶

Part III¶

I. Rescued whippets. ¶

Hypothesis Test Binomial¶

Using binom_test function¶

Manual analysis¶

Confidence Interval (95%)¶

P-value (two sided)¶

Visualizing our Null Outcomes over 300 loop trials¶

II. Dog Size’s. ¶

Checking the weights distribution¶

Hypothesis Anova and Turkey Tess¶

Checking the significant differences.¶

Visualize with boxplot¶

Conclusion:¶

Further analyzation¶

Multi Test¶

Further analyzation onclusion:¶

III. Dogs Color ¶

Chi Square test¶

Leave a Reply Cancel reply

FetchMaker¶

Part I.¶

Part II¶

Part III¶

I. Rescued whippets. ¶

Hypothesis Test Binomial¶

Using binom_test function¶

Manual analysis¶

Confidence Interval (95%)¶

P-value (two sided)¶

Visualizing our Null Outcomes over 300 loop trials¶

II. Dog Size’s. ¶

Checking the weights distribution¶

Hypothesis Anova and Turkey Tess¶

Checking the significant differences.¶

Visualize with boxplot¶

Conclusion:¶

Further analyzation¶

Multi Test¶

Further analyzation onclusion:¶

III. Dogs Color ¶

Chi Square test¶

You Might Also Like

Ridge and Lasso

Differential Calculus

Support Vector Machine

Leave a Reply Cancel reply