đź’Ş
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  • Required packages
  • Loading the data
  • Test for normality
  • ANOVA test
  • Levene's Homogenity Test
  • Unequal Variance: Tamhane T2 test
  • Equal Variance: Post Hoc Tukey Test
  • Kruskal-Wallis
  1. Metabolites and lipids univariate statistical analysis in Python

Multi-sample comparisons in Python

PreviousTwo sample comparisons in PythonNextStatistical annotations on plots

Last updated 7 months ago

Required packages

The required packages for this section are pandas, scipy and statsmodels. These can be installed with the following command in the command window (Windows) / terminal (Mac).

pip install pandas statsmodels scipy scikit-posthocs

Loading the data

We will again use the demo lipidomics dataset:

Load the dataset into a Pandas DataFrame named df as described in the basic plotting section:

import pandas as pd
df = pd.read_excel("Lipidomics_dataset.xlsx")
df.set_index("Sample Name", inplace=True)

Test for normality

Statsmodels contains a Lilliefors’ test for normality testing, which is a Kolmogorov-Smirnov test with estimated parameters. The function can be used as follows, for example on species PC 34:1 for N and T labelled samples:

from statsmodels.stats.diagnostic import kstest_normal

lipid = "PC 34:1"
group1 = df[df.Label == "T"][lipid]
group2 = df[df.Label == "PAN"][lipid]
group3 = df[df.Label == "N"][lipid]

print(kstest_normal(group1, "norm"))
print(kstest_normal(group2, "norm"))
print(kstest_normal(group3, "norm"))

The function returns 2 values: Kolmogorov-Smirnov test statistic and a pvalue; if the pvalue is lower than some threshold, e.g. 0.05, then we can reject the Null hypothesis that the sample comes from a normal distribution.

group T result: (0.08189149328478457, 0.08119506735189941)
group PAN result: (0.11146474201970102, 0.702986809787123)
group N result: (0.08643554971567619, 0.078397863558115)

Since the p-values is >0.05 for all groups, the distributions do not deviate significantly from the normal distribution and we assume normality.

ANOVA test

We'll perform an ANOVA test for the T, PAN and N labelled samples for species PC 34:1. We'll define our model with the ols function (ordinary least squares) from statsmodels, and then we'll pass this model to anova_lm function from statsmodels. Since the ols function doesn't accept spaces or special characters in our species names, we'll replace these with underscores first:

from statsmodels.formula.api import ols
from statsmodels.api import stats

df2 = df.copy()
df2.columns = df2.columns.str.rstrip()
df2.columns = df2.columns.str.replace(' ', '_')
df2.columns = df2.columns.str.replace(':', '_')

model = ols("PC_34_1 ~ Label", data=df2).fit()
ow_anova_table = stats.anova_lm(model, typ=2)
print(ow_anova_table) 

The function returns the following result:

                sum_sq     df          F    PR(>F)
Label     3.197941e+05    2.0  10.250407  0.000055
Residual  3.494196e+06  224.0        NaN       NaN

The One Way ANOVA Test returns a p < .05 which indicates that H0 can be rejected. Accordingly there is a significant difference between the mean of at least one group compared to other groups.

Levene's Homogenity Test

Next, we'll use Levene's Homogenity Test to determine wether the different groups can be assumed to have similar variances:

from scipy.stats import levene

levene(df.loc[df.Label == "T", "PC 34:1"],
       df.loc[df.Label == "PAN", "PC 34:1"],
       df.loc[df.Label == "N", "PC 34:1"])

This function returns the following result:

LeveneResult(statistic=6.072591927982112, pvalue=0.0027022136179025353)

Since Levene's Homogenity Test resulted in a p < .05, this indicates that H0 can be rejected. Accordingly there is a significant difference between the variance of at least one group compared to other groups. This means we can proceed with a Tamhane T2 test. Had the levene's test had a p > .05, H0 could not have been rejected and we could not have assumed a difference between the variances of the groups. In that case we would have used a post hoc Tukey test. We illustrate both cases bellow (keeping in mind the Tamhane T2 test would be the more appropriate test in this case)

Unequal Variance: Tamhane T2 test

We can use the Tamhane T2 test from scikit-posthocs:

import scikit_posthocs as sp

tamhane_table = sp.posthoc_tamhane(df, val_col="PC 34:1", group_col="Label")
print(tamhane_table) 

Which gives us the result:

            N       PAN         T
N    1.000000  0.001194  0.121347
PAN  0.001194  1.000000  0.019051
T    0.121347  0.019051  1.000000

We see that PAN vs N and PAN vs T have a p > 0.05, while T vs N does not.

Equal Variance: Post Hoc Tukey Test

We can use the Tamhane T2 test from scikit-posthocs:

import scikit_posthocs as sp

tukey_table = sp.posthoc_tukey(df, val_col="PC 34:1", group_col="Label")
print(tukey_table) 

Which gives us the result:

            N       PAN         T
N    1.000000  0.001000  0.121107
PAN  0.001000  1.000000  0.002523
T    0.121107  0.002523  1.000000

We see that PAN vs N and PAN vs T have a p > 0.05, while T vs N does not.

Kruskal-Wallis

In case the data is not normally distributed, the non parametric Kruskal-Wallis test is used instead of the ANOVA test. Let's perform another normality test, this time for species PC 36:5:

from statsmodels.stats.diagnostic import kstest_normal

lipid = "PC 36:5"
group1 = df[df.Label == "T"][lipid]
group2 = df[df.Label == "PAN"][lipid]
group3 = df[df.Label == "N"][lipid]

print(kstest_normal(group1, "norm"))
print(kstest_normal(group2, "norm"))
print(kstest_normal(group3, "norm"))

Which returns the results:

(0.12467614039830022, 0.0009999999999998899)
(0.09264557901905313, 0.9091272492376122)
(0.09253781472904482, 0.0448171307073585)

Since the p-values is <0.05 for 2 groups, these 2 distributions deviate significantly from the normal distribution and we can not assume normality. We will proceed with the non parametric Kruskal-Wallis test (we reuse the group1-3 variables calculated above for the normality test):

from scipy.stats import kruskal

kruskal(group1, group2, group3)

Which return the results:

KruskalResult(statistic=54.7072751290757, pvalue=1.319672436175926e-12)

The Kruskal-Wallis test resulted in a p < .05 which indicates that H0 can be rejected. Accordingly there is a significant difference between the distribution of at least one group compared to other groups. This means we can perform a Dunn posthoc test.

import scikit_posthocs as sp
dunn_table = sp.posthoc_dunn(df, val_col="PC 34:1", group_col="Label")
print(dunn_table)

Which gives us the result:

            N       PAN         T
N    1.000000  0.000104  0.113663
PAN  0.000104  1.000000  0.002757
T    0.113663  0.002757  1.000000

We see that PAN vs N and PAN vs T have a p > 0.05, while T vs N does not.

348KB
Lipidomics_dataset.xlsx