The required packages for this section are pandas, scipy and statsmodels. These can be installed with the following command in the command window (Windows) / terminal (Mac).
pip install pandas statsmodels scipy
Loading the data
We will again use the demo lipidomics dataset:
Load the dataset into a Pandas DataFrame named df as described in the basic plotting section:
import pandas as pd
df = pd.read_excel("Lipidomics_dataset.xlsx")
df.set_index("Sample Name", inplace=True)
Test for normality
Statsmodels contains a Lilliefors’ test for normality testing, which is a Kolmogorov-Smirnov test with estimated parameters. The function can be used as follows, for example on species PC 34:1 for N and T labelled samples:
The function returns 2 values: Kolmogorov-Smirnov test statistic and a pvalue; if the pvalue is lower than some threshold, e.g. 0.05, then we can reject the Null hypothesis that the sample comes from a normal distribution.
Since the p-values is >0.05 for both groups, the distributions do not deviate significantly from the normal distribution and we assume normality.
T-Test
We'll perform a T-test between the T and N labelled samples for species PC 34:1. The ttest_ind function from the statsmodels library takes as the first 2 arguments arrays containing the values of the 2 groups, and additionally can be configured for one- or two-sided testing with the alternative parameter. The usevar parameter can be set to "pooled" for equal variance or "unequal" for unequal variance.
The function returns the following result: (test statistic, pvalue of the t-test, degrees of freedom used in the t-test):
(2.0056017821278194, 0.04621995094509444, 204.0)
Since p<0.05, there is a significant difference between the T and N groups for PC 34:1.
T-Test with correction for multiple testing
To correct for multiple testing, we'll start by calculating the t-test individually for all lipid species by putting the previous code in a for loop, and we'll store the results in a Series object:
Next we'll use the multipletests function from statsmodels to adjust the pvalues for multiple testing, and we'll print out all the species with a pvalue < 0.05. The multipletests functions return 4 different result variables, and we are only interested in the corrected p-values, which is returned in the second position. In the other positions we'll type an underscore to ignore these returned results (which is in the first position a boolean that states if the hypothesis can be rejected for the given alpha, and the variables in position 3 and 4 contain a corrected alpha according to Sidak and Bonferroni respectively):