Two sample comparisons in Python
Required packages
The required packages for this section are pandas, scipy and statsmodels. These can be installed with the following command in the command window (Windows) / terminal (Mac).
pip install pandas statsmodels scipy
Loading the data
We will again use the demo lipidomics dataset:
Load the dataset into a Pandas DataFrame named df as described in the basic plotting section:
import pandas as pd
df = pd.read_excel("Lipidomics_dataset.xlsx")
df.set_index("Sample Name", inplace=True)
Test for normality
Statsmodels contains a Lilliefors’ test for normality testing, which is a Kolmogorov-Smirnov test with estimated parameters. The function can be used as follows, for example on species PC 34:1 for N and T labelled samples:
from statsmodels.stats.diagnostic import kstest_normal
lipid = "PC 34:1"
group1 = df[df.Label == "T"][lipid]
group2 = df[df.Label == "N"][lipid]
print(kstest_normal(group1, "norm"))
print(kstest_normal(group2, "norm"))
The function returns 2 values: Kolmogorov-Smirnov test statistic and a pvalue; if the pvalue is lower than some threshold, e.g. 0.05, then we can reject the Null hypothesis that the sample comes from a normal distribution.
group1 result: (0.08189149328478457, 0.08119506735189941)
group2 result: (0.08643554971567619, 0.078397863558115)
Since the p-values is >0.05 for both groups, the distributions do not deviate significantly from the normal distribution and we assume normality.
T-Test
We'll perform a T-test between the T and N labelled samples for species PC 34:1. The ttest_ind function from the statsmodels library takes as the first 2 arguments arrays containing the values of the 2 groups, and additionally can be configured for one- or two-sided testing with the alternative parameter. The usevar parameter can be set to "pooled" for equal variance or "unequal" for unequal variance.
from statsmodels.stats.weightstats import ttest_ind
lipid = "PC 34:1"
group1 = df[df.Label == "T"][lipid]
group2 = df[df.Label == "N"][lipid]
ttest_ind(group1, group2, alternative='two-sided', usevar='pooled')
The function returns the following result: (test statistic, pvalue of the t-test, degrees of freedom used in the t-test):
(2.0056017821278194, 0.04621995094509444, 204.0)
Since p<0.05, there is a significant difference between the T and N groups for PC 34:1.
T-Test with correction for multiple testing
To correct for multiple testing, we'll start by calculating the t-test individually for all lipid species by putting the previous code in a for loop, and we'll store the results in a Series object:
from statsmodels.stats.multitest import multipletests
pvalues = pd.Series()
for lipid in df.columns[1:]:
group1 = df[df.Label == "T"][lipid]
group2 = df[df.Label == "N"][lipid]
_, pvalue, _ = ttest_ind(group1, group2, alternative='two-sided', usevar='pooled')
pvalues[lipid] = pvalue
Next we'll use the multipletests function from statsmodels to adjust the pvalues for multiple testing, and we'll print out all the species with a pvalue < 0.05. The multipletests functions return 4 different result variables, and we are only interested in the corrected p-values, which is returned in the second position. In the other positions we'll type an underscore to ignore these returned results (which is in the first position a boolean that states if the hypothesis can be rejected for the given alpha, and the variables in position 3 and 4 contain a corrected alpha according to Sidak and Bonferroni respectively):
_, corr_pvalues, _, _ = multipletests(
pvalues,
alpha=0.05,
method='holm-sidak'
)
corr_pvalues = pd.Series(corr_pvalues, index=pvalues.index)
print(corr_pvalues[corr_pvalues < 0.05])
result:
CE 16:0 3.009264e-03
CE 18:3 8.066735e-06
CE 18:2 1.338064e-05
CE 20:4 1.812175e-02
CE 20:3 2.942459e-02
TG 50:4 1.275879e-02
TG 52:6 1.766055e-04
TG 52:5 2.988285e-02
TG 52:2 1.376055e-02
TG 54:8 5.144980e-04
TG 54:7 1.413293e-02
TG 56:2 2.978565e-02
DG 32:2 4.213467e-02
DG 34:3 1.789583e-02
DG 38:7 1.112511e-03
Cer 34:1;O2 4.749118e-04
Cer 36:1;O2 3.444237e-05
Cer 39:1;O2 8.102510e-06
Cer 41:1;O2 1.021419e-10
Cer 42:1;O2 2.487048e-11
PC 32:2 9.507684e-07
PC O-34:3/P-34:2 4.444021e-13
PC O-34:2/P-34:1 3.984211e-12
PC 34:2 1.713111e-09
PC O-36:5/P-36:4 5.144980e-04
PC O-36:4/P-36:3 7.080177e-05
PC O-36:3/P-36:2 2.098197e-12
PC 36:5 8.592539e-06
PC 36:3 3.099436e-03
PC 36:2 7.159109e-07
PC O-38:6/P-38:5 1.072691e-06
PC 37:2 1.808452e-12
PC 38:5 2.103493e-05
LPC 14:0 1.575674e-08
LPC 16:0 1.815249e-06
LPC 18:2 2.097741e-16
LPC 18:1 1.181612e-02
LPC 18:0 1.552810e-11
LPC 20:4 6.959002e-03
SM 32:1;O2 5.421950e-07
SM 38:2;O2 1.098073e-02
SM 38:1;O2 5.158079e-08
SM 39:1;O2 4.126438e-18
SM 40:2;O2 1.713111e-09
SM 40:1;O2 6.855879e-11
SM 41:2;O2 1.290760e-10
SM 41:1;O2 1.080575e-19
SM 42:1;O2 1.130474e-15
Different methods can be selected for the correction, which are listed in the statsmodels documentation.
Mann-Whitney
In case the data is not normally distributed, the non parametric Mann-Whitney test is used instead of the T-test.
from scipy.stats import mannwhitneyu
lipid = "PC 36:5"
group1 = df[df.Label == "T"][lipid]
group2 = df[df.Label == "N"][lipid]
mannwhitneyu(group1, group2)
result:
MannwhitneyuResult(statistic=2467.0, pvalue=4.0874221736427684e-11)
And to get the Mann-Whitney test on all species with correction for multiple testing:
from statsmodels.stats.multitest import multipletests
from scipy.stats import mannwhitneyu
pvalues = pd.Series()
for lipid in df.columns[1:]:
group1 = df[df.Label == "T"][lipid]
group2 = df[df.Label == "N"][lipid]
_, pvalue = mannwhitneyu(group1, group2)
pvalues[lipid] = pvalue
_, corr_pvalues, _, _ = multipletests(
pvalues,
alpha=0.05,
method='holm-sidak'
)
corr_pvalues = pd.Series(corr_pvalues, index=pvalues.index)
print(corr_pvalues[corr_pvalues < 0.05])
Last updated