đź’Ş
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  • Required packages
  • Loading the data
  • Test for normality
  • T-Test
  • T-Test with correction for multiple testing
  • Mann-Whitney
  1. Metabolites and lipids univariate statistical analysis in Python

Two sample comparisons in Python

PreviousLipid maps and acyl-chain plotsNextMulti-sample comparisons in Python

Last updated 7 months ago

Required packages

The required packages for this section are pandas, scipy and statsmodels. These can be installed with the following command in the command window (Windows) / terminal (Mac).

pip install pandas statsmodels scipy

Loading the data

We will again use the demo lipidomics dataset:

Load the dataset into a Pandas DataFrame named df as described in the basic plotting section:

import pandas as pd
df = pd.read_excel("Lipidomics_dataset.xlsx")
df.set_index("Sample Name", inplace=True)

Test for normality

Statsmodels contains a Lilliefors’ test for normality testing, which is a Kolmogorov-Smirnov test with estimated parameters. The function can be used as follows, for example on species PC 34:1 for N and T labelled samples:

from statsmodels.stats.diagnostic import kstest_normal

lipid = "PC 34:1"
group1 = df[df.Label == "T"][lipid]
group2 = df[df.Label == "N"][lipid]

print(kstest_normal(group1, "norm"))
print(kstest_normal(group2, "norm"))

The function returns 2 values: Kolmogorov-Smirnov test statistic and a pvalue; if the pvalue is lower than some threshold, e.g. 0.05, then we can reject the Null hypothesis that the sample comes from a normal distribution.

group1 result: (0.08189149328478457, 0.08119506735189941)
group2 result: (0.08643554971567619, 0.078397863558115)

Since the p-values is >0.05 for both groups, the distributions do not deviate significantly from the normal distribution and we assume normality.

T-Test

We'll perform a T-test between the T and N labelled samples for species PC 34:1. The ttest_ind function from the statsmodels library takes as the first 2 arguments arrays containing the values of the 2 groups, and additionally can be configured for one- or two-sided testing with the alternative parameter. The usevar parameter can be set to "pooled" for equal variance or "unequal" for unequal variance.

from statsmodels.stats.weightstats import ttest_ind

lipid = "PC 34:1"
group1 = df[df.Label == "T"][lipid]
group2 = df[df.Label == "N"][lipid]
ttest_ind(group1, group2, alternative='two-sided', usevar='pooled') 

The function returns the following result: (test statistic, pvalue of the t-test, degrees of freedom used in the t-test):

(2.0056017821278194, 0.04621995094509444, 204.0)

Since p<0.05, there is a significant difference between the T and N groups for PC 34:1.

T-Test with correction for multiple testing

To correct for multiple testing, we'll start by calculating the t-test individually for all lipid species by putting the previous code in a for loop, and we'll store the results in a Series object:

from statsmodels.stats.multitest import multipletests

pvalues = pd.Series()
for lipid in df.columns[1:]:
    group1 = df[df.Label == "T"][lipid]
    group2 = df[df.Label == "N"][lipid]
    _, pvalue, _ = ttest_ind(group1, group2, alternative='two-sided', usevar='pooled')
    pvalues[lipid] = pvalue

Next we'll use the multipletests function from statsmodels to adjust the pvalues for multiple testing, and we'll print out all the species with a pvalue < 0.05. The multipletests functions return 4 different result variables, and we are only interested in the corrected p-values, which is returned in the second position. In the other positions we'll type an underscore to ignore these returned results (which is in the first position a boolean that states if the hypothesis can be rejected for the given alpha, and the variables in position 3 and 4 contain a corrected alpha according to Sidak and Bonferroni respectively):

_, corr_pvalues, _, _ = multipletests(
    pvalues, 
    alpha=0.05, 
    method='holm-sidak'
    )
corr_pvalues = pd.Series(corr_pvalues, index=pvalues.index)
print(corr_pvalues[corr_pvalues < 0.05])

result:

CE 16:0             3.009264e-03
CE 18:3             8.066735e-06
CE 18:2             1.338064e-05
CE 20:4             1.812175e-02
CE 20:3             2.942459e-02
TG 50:4             1.275879e-02
TG 52:6             1.766055e-04
TG 52:5             2.988285e-02
TG 52:2             1.376055e-02
TG 54:8             5.144980e-04
TG 54:7             1.413293e-02
TG 56:2             2.978565e-02
DG 32:2             4.213467e-02
DG 34:3             1.789583e-02
DG 38:7             1.112511e-03
Cer 34:1;O2         4.749118e-04
Cer 36:1;O2         3.444237e-05
Cer 39:1;O2         8.102510e-06
Cer 41:1;O2         1.021419e-10
Cer 42:1;O2         2.487048e-11
PC 32:2             9.507684e-07
PC O-34:3/P-34:2    4.444021e-13
PC O-34:2/P-34:1    3.984211e-12
PC 34:2             1.713111e-09
PC O-36:5/P-36:4    5.144980e-04
PC O-36:4/P-36:3    7.080177e-05
PC O-36:3/P-36:2    2.098197e-12
PC 36:5             8.592539e-06
PC 36:3             3.099436e-03
PC 36:2             7.159109e-07
PC O-38:6/P-38:5    1.072691e-06
PC 37:2             1.808452e-12
PC 38:5             2.103493e-05
LPC 14:0            1.575674e-08
LPC 16:0            1.815249e-06
LPC 18:2            2.097741e-16
LPC 18:1            1.181612e-02
LPC 18:0            1.552810e-11
LPC 20:4            6.959002e-03
SM 32:1;O2          5.421950e-07
SM 38:2;O2          1.098073e-02
SM 38:1;O2          5.158079e-08
SM 39:1;O2          4.126438e-18
SM 40:2;O2          1.713111e-09
SM 40:1;O2          6.855879e-11
SM 41:2;O2          1.290760e-10
SM 41:1;O2          1.080575e-19
SM 42:1;O2          1.130474e-15

Mann-Whitney

In case the data is not normally distributed, the non parametric Mann-Whitney test is used instead of the T-test.

from scipy.stats import mannwhitneyu

lipid = "PC 36:5"
group1 = df[df.Label == "T"][lipid]
group2 = df[df.Label == "N"][lipid]
mannwhitneyu(group1, group2)

result:

MannwhitneyuResult(statistic=2467.0, pvalue=4.0874221736427684e-11)

And to get the Mann-Whitney test on all species with correction for multiple testing:

from statsmodels.stats.multitest import multipletests
from scipy.stats import mannwhitneyu

pvalues = pd.Series()
for lipid in df.columns[1:]:
    group1 = df[df.Label == "T"][lipid]
    group2 = df[df.Label == "N"][lipid]
    _, pvalue = mannwhitneyu(group1, group2)
    pvalues[lipid] = pvalue
    
_, corr_pvalues, _, _ = multipletests(
    pvalues, 
    alpha=0.05, 
    method='holm-sidak'
    )
corr_pvalues = pd.Series(corr_pvalues, index=pvalues.index)
print(corr_pvalues[corr_pvalues < 0.05])

Different methods can be selected for the correction, which are listed in the .

statsmodels documentation
348KB
Lipidomics_dataset.xlsx