Multi-sample comparisons in Python
Last updated
Last updated
The required packages for this section are pandas, scipy and statsmodels. These can be installed with the following command in the command window (Windows) / terminal (Mac).
We will again use the demo lipidomics dataset:
Load the dataset into a Pandas DataFrame named df as described in the basic plotting section:
Statsmodels contains a Lilliefors’ test for normality testing, which is a Kolmogorov-Smirnov test with estimated parameters. The function can be used as follows, for example on species PC 34:1 for N and T labelled samples:
The function returns 2 values: Kolmogorov-Smirnov test statistic and a pvalue; if the pvalue is lower than some threshold, e.g. 0.05, then we can reject the Null hypothesis that the sample comes from a normal distribution.
Since the p-values is >0.05 for all groups, the distributions do not deviate significantly from the normal distribution and we assume normality.
We'll perform an ANOVA test for the T, PAN and N labelled samples for species PC 34:1. We'll define our model with the ols function (ordinary least squares) from statsmodels, and then we'll pass this model to anova_lm function from statsmodels. Since the ols function doesn't accept spaces or special characters in our species names, we'll replace these with underscores first:
The function returns the following result:
The One Way ANOVA Test returns a p < .05 which indicates that H0 can be rejected. Accordingly there is a significant difference between the mean of at least one group compared to other groups.
Next, we'll use Levene's Homogenity Test to determine wether the different groups can be assumed to have similar variances:
This function returns the following result:
Since Levene's Homogenity Test resulted in a p < .05, this indicates that H0 can be rejected. Accordingly there is a significant difference between the variance of at least one group compared to other groups. This means we can proceed with a Tamhane T2 test. Had the levene's test had a p > .05, H0 could not have been rejected and we could not have assumed a difference between the variances of the groups. In that case we would have used a post hoc Tukey test. We illustrate both cases bellow (keeping in mind the Tamhane T2 test would be the more appropriate test in this case)
We can use the Tamhane T2 test from scikit-posthocs:
Which gives us the result:
We see that PAN vs N and PAN vs T have a p > 0.05, while T vs N does not.
We can use the Tamhane T2 test from scikit-posthocs:
Which gives us the result:
We see that PAN vs N and PAN vs T have a p > 0.05, while T vs N does not.
In case the data is not normally distributed, the non parametric Kruskal-Wallis test is used instead of the ANOVA test. Let's perform another normality test, this time for species PC 36:5:
Which returns the results:
Since the p-values is <0.05 for 2 groups, these 2 distributions deviate significantly from the normal distribution and we can not assume normality. We will proceed with the non parametric Kruskal-Wallis test (we reuse the group1-3 variables calculated above for the normality test):
Which return the results:
The Kruskal-Wallis test resulted in a p < .05 which indicates that H0 can be rejected. Accordingly there is a significant difference between the distribution of at least one group compared to other groups. This means we can perform a Dunn posthoc test.
Which gives us the result:
We see that PAN vs N and PAN vs T have a p > 0.05, while T vs N does not.