Hypothesis Testing2

Familiar_A Study In Data Analysis

Familiar (Blood Transfusion Company)

Welcome to Familiar, a startup in the new market of blood transfusion!

WHAT-IS-Blood-Transfusion.jpg

Blood transfusion is the process of transferring blood products into a person’s circulation intravenously. Transfusions are used for various medical conditions to replace lost components of the blood.

Part I.

Familiar's best package

  • The first thing we want to know is whether Familiar’s most basic package, the Vein Pack, actually has a significant impact on the subscribers. It would be a marketing goldmine if we can show that subscribers to the Vein Pack live longer than other people.

Part II

Life span

  • Compare the lifespan data between different packages (Vien and Artery).

Part III

Side Effect

  • Analyze the side effect of different packages (Vien and Artery)

Loading Data

In [41]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_1samp, ttest_ind, chi2_contingency
import matplotlib.pyplot as plt
import seaborn as sns

Data Inspection

In [42]:
df = pd.read_csv('familiar_lifespan.csv')
print(df.shape)
df.sample(5)
(40, 2)
Out[42]:
pack lifespan
5 artery 74.117420
34 vein 76.532101
11 vein 77.484756
32 artery 73.044766
1 artery 76.404504
In [43]:
# checking outliers
plt.figure(figsize=(4,3))
sns.boxplot(x='pack', y='lifespan', data=df)
Out[43]:
<AxesSubplot:xlabel='pack', ylabel='lifespan'>
In [48]:
# Check the value_counts for the outliers
# We have only one observation with 68 lifespan under artery, I will replace this with the artery mean value.
lifespans.value_counts().sort_values(ascending=False).head(3)
Out[48]:
pack    lifespan 
artery  68.314898    1
        73.044766    1
        74.639757    1
dtype: int64
In [49]:
# Calculating group means
lifespans.groupby('pack').mean()
Out[49]:
lifespan
pack
artery 74.873662
vein 76.169013
In [50]:
# replacing artery values that are lower than 69 with artery mean
lifespans['lifespan'] = lifespans['lifespan'].where(lifespans['lifespan'] > 69.0, 74.873662 ) 
In [52]:
# Checking our NEW dataframe
plt.figure(figsize=(4,3))
sns.boxplot(x='pack', y='lifespan', data=lifespans)
Out[52]:
<AxesSubplot:xlabel='pack', ylabel='lifespan'>

I. Familiar’s best package

We’d like to find out if the average lifespan of Familiar’s best seller ‘Vien Package’ is significantly different from the average life expectancy of 73 years.

Hypothesis Testing (One sample T-test)

Comparing a sample average to a hypothetical population average

  • Null: The average lifespan of a Vein Pack subscriber is 73 years.
  • Alternative: The average lifespan of a Vein Pack subscriber is NOT 73 years.
In [58]:
tstat, pval = ttest_1samp(vein_pack_lifespans, 73  )
print('Pvalue: ' + str('{:.10f}'.format(pval)))
Pvalue: 0.0000005972
In [81]:
plt.figure(figsize=(5,3.5))
sns.histplot(vein_pack_lifespans, kde= True)
# plt.hist(np.array(vein_pack_lifespans.lifespan))
plt.axvline(73, color = 'g', label='Expected Mean', linestyle ='--')
plt.axvline(vein_pack_lifespans.mean(), color = 'r', label='Observed Mean',linestyle ='--')
plt.legend(loc=0)
plt.show()

Conclusion:

Reject the null hypothesis. Subribers who take the Vien package has longer lifespan.

II. Life span

Pumping Life Into The Company

We’d like to find out if the average lifespan of a Vein Pack subscriber is significantly different from the average life expectancy for the Artery Pack.

In order to differentiate Familiar’s different product lines, we’d like to compare this lifespan data between our different packages. Our next step up from the Vein Pack is the Artery Pack.

Hypothesis Testing

Two sample T-test

For an association between a Binary(two) Categorical Variable and a Quantitative Variable.

  • Null: The average lifespan of a Vein Pack subscriber is equal to the average lifespan of an Artery Pack subscriber.
  • Alternative: The average lifespan of a Vein Pack subscriber is NOT equal to the average lifespan of an Artery Pack subscriber.
In [83]:
# Check if STD is equal
# a ratio between 0.9 and 1.1 should suffice
# result is considerable
ratio = np.std(vein_pack_lifespans) / np.std(artery_pack_lifespans)
ratio
Out[83]:
1.2193072088987222
In [84]:
tstat, pval =ttest_ind(vein_pack_lifespans, artery_pack_lifespans)
print('P-value: ' + str(pval))
P-value: 0.09164417205559142
In [93]:
plt.figure(figsize=(6,4))
plt.hist(vein_pack_lifespans,   alpha=.5, label='Vien Package', density=True)
plt.hist(artery_pack_lifespans, alpha=.5, label='Artery Package',density=True)

plt.title('Vien vs. Artery lifespan', fontsize=20)
plt.xlabel('Lifespan in years', fontsize=15)
plt.ylabel('Count', fontsize=15)

plt.axvline(np.mean(vein_pack_lifespans), color = 'b', label='Vien Package Mean', linestyle ='--')
plt.axvline(np.mean(artery_pack_lifespans), color = 'orange', label='Artery Package Mean', linestyle ='--')

plt.legend(fontsize='x-small')
plt.show()

Conclusion:

Our P-value is 0.09164 a little bit larger than 0.05. I’am failed to reject the null hypothesis, so I conclude that the average lifespan of Vein Pack subscribers are not significantly different from the average lifespan of an Artery Pack subscriber, though the Vien’s package has a little bit higher lifespans on average.

III. Side Effects:

A Familiar Problem

Familiar wants to be able to advise potential subscribers about possible side effects of these packs and whether they differ for the Vein vs. the Artery pack.

In [94]:
iron = pd.read_csv('familiar_iron.csv')
In [95]:
# Data Checking
iron.iron.unique()
Out[95]:
array(['low', 'normal', 'high'], dtype=object)
In [96]:
# Data Checking
iron.dtypes
Out[96]:
pack    object
iron    object
dtype: object
In [97]:
# I want to convert the iron variable to  ordinal categorical type
iron.iron = pd.Categorical(iron.iron,['low','normal','high'], ordered=True)
iron.dtypes
Out[97]:
pack      object
iron    category
dtype: object

Hypothesis Testing

Chi square test

Two Categorical Variables

  • Null: There is NO association either which pack (Vein vs. Artery) between iron level.
  • Alternative: There is an association either which pack (Vein vs. Artery) between iron level.

Checking the association between the pack that a subscriber gets (Vein vs. Artery) and their iron level.

In [98]:
iron.head(2)
Out[98]:
pack iron
0 vein low
1 artery normal
In [99]:
Contingency_table = pd.crosstab(iron.pack, iron.iron)
Contingency_table
Out[99]:
iron low normal high
pack
artery 29 29 87
vein 140 40 20
In [100]:
chi2, pval, dof, expected = chi2_contingency(Contingency_table)
print('P-value: ' + str('{:.30f}'.format(pval)))
P-value: 0.000000000000000000000000935975

Conclusion

P value is very low. I strongly recommend to reject the null hypothesis. There is a significant difference in iron level between someone who take Vien pack compare to Artery pack.

Leave a Reply