> For the complete documentation index, see [llms.txt](https://laboratory-of-lipid-metabolism-a.gitbook.io/omics-data-visualization-in-r-and-python/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://laboratory-of-lipid-metabolism-a.gitbook.io/omics-data-visualization-in-r-and-python/metabolites-and-lipids-univariate-statistical-analysis-in-python/two-sample-comparisons-in-python.md).

# Two sample comparisons in Python

## Required packages

The required packages for this section are pandas, scipy and statsmodels. These can be installed with the following command in the command window (Windows) / terminal (Mac).

```
pip install pandas statsmodels scipy
```

## Loading the data

We will again use the demo lipidomics dataset:

{% file src="/files/O8GcvqGYOzpXcfU7NayN" %}

Load the dataset into a Pandas DataFrame named df as described in the basic plotting section:

```python
import pandas as pd
df = pd.read_excel("Lipidomics_dataset.xlsx")
df.set_index("Sample Name", inplace=True)
```

## Test for normality

Statsmodels contains a Lilliefors’ test for normality testing, which is a Kolmogorov-Smirnov test with estimated parameters. The function can be used as follows, for example on species PC 34:1 for N and T labelled samples:

```python
from statsmodels.stats.diagnostic import kstest_normal

lipid = "PC 34:1"
group1 = df[df.Label == "T"][lipid]
group2 = df[df.Label == "N"][lipid]

print(kstest_normal(group1, "norm"))
print(kstest_normal(group2, "norm"))
```

The function returns 2 values: Kolmogorov-Smirnov test statistic and a pvalue; if the pvalue is lower than some threshold, e.g. 0.05, then we can reject the Null hypothesis that the sample comes from a normal distribution.

```
group1 result: (0.08189149328478457, 0.08119506735189941)
group2 result: (0.08643554971567619, 0.078397863558115)
```

Since the p-values is >0.05 for both groups, the distributions do not deviate significantly from the normal distribution and we assume normality.

## T-Test

We'll perform a T-test between the T and N labelled samples for species PC 34:1. The ttest\_ind function from the statsmodels library takes as the first 2 arguments arrays containing the values of the 2 groups, and additionally can be configured for one- or two-sided testing with the *alternative* parameter. The *usevar* parameter can be set to "pooled" for equal variance or "unequal" for unequal variance.

```python
from statsmodels.stats.weightstats import ttest_ind

lipid = "PC 34:1"
group1 = df[df.Label == "T"][lipid]
group2 = df[df.Label == "N"][lipid]
ttest_ind(group1, group2, alternative='two-sided', usevar='pooled') 
```

The function returns the following result: (test statistic, pvalue of the t-test, degrees of freedom used in the t-test):

```
(2.0056017821278194, 0.04621995094509444, 204.0)
```

Since p<0.05, there is a significant difference between the T and N groups for PC 34:1.

## T-Test with correction for multiple testing

To correct for multiple testing, we'll start by calculating the t-test individually for all lipid species by putting the previous code in a for loop, and we'll store the results in a Series object:

```python
from statsmodels.stats.multitest import multipletests

pvalues = pd.Series()
for lipid in df.columns[1:]:
    group1 = df[df.Label == "T"][lipid]
    group2 = df[df.Label == "N"][lipid]
    _, pvalue, _ = ttest_ind(group1, group2, alternative='two-sided', usevar='pooled')
    pvalues[lipid] = pvalue
```

Next we'll use the *multipletests* function from *statsmodels* to adjust the pvalues for multiple testing, and we'll print out all the species with a pvalue < 0.05. The multipletests functions return 4 different result variables, and we are only interested in the corrected p-values, which is returned in the second position. In the other positions we'll type an underscore to ignore these returned results (which is in the first position a boolean that states if the hypothesis can be rejected for the given alpha, and the variables in position 3 and 4 contain a corrected alpha according to Sidak and Bonferroni respectively):

```python
_, corr_pvalues, _, _ = multipletests(
    pvalues, 
    alpha=0.05, 
    method='holm-sidak'
    )
corr_pvalues = pd.Series(corr_pvalues, index=pvalues.index)
print(corr_pvalues[corr_pvalues < 0.05])
```

result:

```
CE 16:0             3.009264e-03
CE 18:3             8.066735e-06
CE 18:2             1.338064e-05
CE 20:4             1.812175e-02
CE 20:3             2.942459e-02
TG 50:4             1.275879e-02
TG 52:6             1.766055e-04
TG 52:5             2.988285e-02
TG 52:2             1.376055e-02
TG 54:8             5.144980e-04
TG 54:7             1.413293e-02
TG 56:2             2.978565e-02
DG 32:2             4.213467e-02
DG 34:3             1.789583e-02
DG 38:7             1.112511e-03
Cer 34:1;O2         4.749118e-04
Cer 36:1;O2         3.444237e-05
Cer 39:1;O2         8.102510e-06
Cer 41:1;O2         1.021419e-10
Cer 42:1;O2         2.487048e-11
PC 32:2             9.507684e-07
PC O-34:3/P-34:2    4.444021e-13
PC O-34:2/P-34:1    3.984211e-12
PC 34:2             1.713111e-09
PC O-36:5/P-36:4    5.144980e-04
PC O-36:4/P-36:3    7.080177e-05
PC O-36:3/P-36:2    2.098197e-12
PC 36:5             8.592539e-06
PC 36:3             3.099436e-03
PC 36:2             7.159109e-07
PC O-38:6/P-38:5    1.072691e-06
PC 37:2             1.808452e-12
PC 38:5             2.103493e-05
LPC 14:0            1.575674e-08
LPC 16:0            1.815249e-06
LPC 18:2            2.097741e-16
LPC 18:1            1.181612e-02
LPC 18:0            1.552810e-11
LPC 20:4            6.959002e-03
SM 32:1;O2          5.421950e-07
SM 38:2;O2          1.098073e-02
SM 38:1;O2          5.158079e-08
SM 39:1;O2          4.126438e-18
SM 40:2;O2          1.713111e-09
SM 40:1;O2          6.855879e-11
SM 41:2;O2          1.290760e-10
SM 41:1;O2          1.080575e-19
SM 42:1;O2          1.130474e-15
```

Different methods can be selected for the correction, which are listed in the [statsmodels documentation](https://www.statsmodels.org/dev/generated/statsmodels.stats.multitest.multipletests.html).&#x20;

## Mann-Whitney

In case the data is not normally distributed, the non parametric Mann-Whitney test is used instead of the T-test.

<pre class="language-python"><code class="lang-python">from scipy.stats import mannwhitneyu

<strong>lipid = "PC 36:5"
</strong>group1 = df[df.Label == "T"][lipid]
group2 = df[df.Label == "N"][lipid]
mannwhitneyu(group1, group2)
</code></pre>

result:

```
MannwhitneyuResult(statistic=2467.0, pvalue=4.0874221736427684e-11)
```

And to get the Mann-Whitney test on all species with correction for multiple testing:

```python
from statsmodels.stats.multitest import multipletests
from scipy.stats import mannwhitneyu

pvalues = pd.Series()
for lipid in df.columns[1:]:
    group1 = df[df.Label == "T"][lipid]
    group2 = df[df.Label == "N"][lipid]
    _, pvalue = mannwhitneyu(group1, group2)
    pvalues[lipid] = pvalue
    
_, corr_pvalues, _, _ = multipletests(
    pvalues, 
    alpha=0.05, 
    method='holm-sidak'
    )
corr_pvalues = pd.Series(corr_pvalues, index=pvalues.index)
print(corr_pvalues[corr_pvalues < 0.05])
```
