💪
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  1. Metabolites and lipids univariate statistics in R

Adjustments of p-values for multiple comparisons

Metabolites and lipids univariate statistical analysis in R

PreviousMulti sample comparisons in RNextEffect size computation and interpretation

Last updated 3 months ago

In manuscripts presenting the statistical analysis of lipidomics or metabolomics data, you will often find the expression "p-value adjusted for multiple comparisons". Examples of such manuscripts are presented below:

  • M. Osetrova et al. Lipidome atlas of the adult human brain. DOI:

  • J. Idkowiak et al. Robust and high-throughput lipidomic quantitation of human blood samples using flow injection analysis with tandem mass spectrometry for clinical use. DOI:

  • A. Kvasnička et al. Alterations in lipidome profiles distinguish early-onset hyperuricemia, gout, and the effect of urate-lowering treatment. DOI:

  • D. Hornburg et al. Dynamic lipidome alterations associated with human health, disease and ageing. DOI:

  • R. Jirásko et al. Altered Plasma, Urine, and Tissue Profiles of Sulfatides and Sphingomyelins in Patients with Renal Cell Carcinoma. DOI:

  • D. Wolrab et al. Lipidomic profiling of human serum enables detection of pancreatic cancer. DOI:

Here, we will briefly explain to you what it means.

While looking for differences between two biological groups, we usually perform hypothesis testing on many features (lipids, metabolites). For our exemplary data set used throughout the Gitbook - concentrations were measured for 127 lipids. Hence, we would compute 127 statistical tests. Repeating, for example, a t-test analysis increases the chance of obtaining false positive results. Namely, the test results would suggest differences in means between two groups with p-value < 0.05, while in reality, the null hypothesis about no differences should be retained. This situation is also referred to as a 'type I error'. Frequently, the results of such tests are at the edge of acceptance, e.g., the p-value ranges between 0.03 and 0.05 - the threshold for acceptance/rejection of the null hypothesis. This is why we "adjust p-values for multiple comparisons". To prevent reporting results, which potentially are false positives. In the effect of the adjustment, a mathematical correction is performed which leads to an increase of p-values, causing the results at the edge of acceptance (p-values close to 0.05) to be finally rejected. The corrected p-values are above the assumed acceptance threshold and the null hypothesis is retained. A number of methods for adjusting p-values exist. Here, we will show you examples of two of them - the Bonferroni correction, considered a very conservative method and a very popular - the Benjamini-Hochberg (BH) method also referred to as 'false discovery rate', or 'fdr'.

At this step, we would strongly encourage you to watch a YouTube video prepared by Josh Starmer, who created StatQuest - a YouTube channel aiming to explain frequently used statistical methods and tests. StatQuest originates from the University of North Carolina at Chapel Hill. In the video, Josh explains in simple words the concept of the false discovery rate and how the Benjamini-Hochberg works. You will also realize that the actual math behind the BH method is simple.

Several methods exist in R for implementing p-value corrections. We will correct p-values through the adjust_pvalue() function from the rstatix package, which can be applied right after the t_test() in the pipeline. First, we will select the method of the Bonferroni correction. Take a look at this code:

# Adjusting p-values for multiple comparisons.
# Calling library:
library(rstatix)
library(tidyverse)

# Reading the documentation:
?adjust_pvalue()

# Example - t-test:
t.test.bonferroni.signif <- 
  data %>%
  select(-`Sample Name`) %>%
  pivot_longer(cols = `CE 16:1`:`SM 42:1;O2`,
               names_to = "Lipids",
               values_to = "Concentrations") %>%
  group_by(Lipids) %>%
  t_test(Concentrations ~ Label,
         ref.group = "N",
         p.adjust.method = 'none') %>%
  adjust_pvalue(method = "bonferroni") %>%
  add_significance(p.col = "p.adj") %>%
  filter(p.adj < 0.05) %>%
  arrange(p.adj)

Here, we select from the wide 'data' tibble all numeric variables and the factor variable stored in the Label column. We create a long tibble consisting of columns Label, Lipids, and Concentrations. Next, we group values by lipid species, and we perform t-tests on all numeric variables, comparing mean plasma concentrations of lipids in patients with pancreatitis (PAN) or pancreatic cancer (PDAC) to the reference group - healthy controls (N). The argument p.adjust.method is set to 'none', meaning we will adjust the p-values in the next step through adjust_pvalue(). In the adjust_pvalue(), the argument method is set to 'bonferroni'. We additionally add to the final tibble a column with asterisk annotations representing the adjusted p-values (add_significance()), filter the results to keep only those that remain significant after the correction, and arrange them by the 'p.adj' column entries. This way, we retain 41 significant outcomes.

The math behind this correction is also very simple:

p.adj=p∗np.adj= p*np.adj=p∗n

The corrected p-value (p.adj) is simply a raw p-value (p) from the t-test multiplied by the total number of comparisons - t-tests performed (n). 127 lipids are compared for N vs T and N vs PAN = 127x2 = 254 comparisons. Now, you can realize how strict the Bonferroni correction is.

We can quickly check how many of the raw p-values, which were below the threshold of 0.05 become insignificant after correction using this code:

# Checking the Bonferroni correction influence:
t.test.bonferroni.no.signif <- 
  data %>%
  select(`Label`,
         where(is.numeric)) %>%
  pivot_longer(cols = `CE 16:1`:`SM 42:1;O2`,
               names_to = "Lipids",
               values_to = "Concentrations") %>%
  group_by(Lipids) %>%
  t_test(Concentrations ~ Label,
         ref.group = "N",
         p.adjust.method = "none") %>%
  adjust_pvalue(method = "bonferroni") %>%
  filter(p < 0.05 & p.adj >= 0.05) %>%
  arrange(p.adj)

This way, we find that 87 lipids were filtered out using the Bonferroni correction, with p-values ranging from 0.000251 to 0.049. Finally, we can check the number of insignificant outcomes:

# Checking the Bonferroni correction influence:
t.test.no.signif <- 
  data %>%
  select(`Label`,
         where(is.numeric)) %>%
  pivot_longer(cols = `CE 16:1`:`SM 42:1;O2`,
               names_to = "Lipids",
               values_to = "Concentrations") %>%
  group_by(Lipids) %>%
  t_test(Concentrations ~ Label,
         ref.group = "N",
         p.adjust.method = "none") %>%
  filter(p >= 0.05) %>%
  arrange(p.adj)

We obtained 126 insignificant outcomes, together making 254 tests and only 32% of all significant results were retained after the Bonferroni correction.

Now, let's perform similar computations using the Benjamini-Hochberg method:

# Using the Benjamini-Hochberg approach for adjusting p-values:
t.test.BH.signif <- 
  data %>%
  select(-`Sample Name`) %>%
  pivot_longer(cols = `CE 16:1`:`SM 42:1;O2`,
               names_to = "Lipids",
               values_to = "Concentrations") %>%
  group_by(Lipids) %>%
  t_test(Concentrations ~ Label,
         ref.group = "N",
         p.adjust.method = 'none') %>%
  adjust_pvalue(method = "BH") %>%
  filter(p.adj < 0.05) %>%
  arrange(p.adj)

# Checking the Benjamini-Hochberg approach influence:
t.test.BH.no.signif <- 
  data %>%
  select(`Label`,
         where(is.numeric)) %>%
  pivot_longer(cols = `CE 16:1`:`SM 42:1;O2`,
               names_to = "Lipids",
               values_to = "Concentrations") %>%
  group_by(Lipids) %>%
  t_test(Concentrations ~ Label,
         ref.group = "N",
         p.adjust.method = "none") %>%
  adjust_pvalue(method = "BH") %>%
  filter(p < 0.05 & p.adj >= 0.05) %>%
  arrange(p.adj)

We obtained 97 significant differences after the BH correction and 31 outcomes that were filtered using the BH correction, with 126 insignificant outcomes giving 254 comparisons. In this case, nearly 76% of results were retained. The BH correction filtered out positive t-test results with p-values in the range of 0.02 to 0.049.

The Bonferroni correction retained only the differences with the lowest p-values when compared to the BH correction. However, there is a chance that such a conservative correction will filter out a significant number of significant differences, potentially leading to a dramatic increase in the number of type II errors - failure to detect actual differences.

Hence, we will continue applying the classic Benjamini-Hochberg correction to remove potential false positives in the next parts of this Gitbook.

https://doi.org/10.1038/s41467-024-48734-y
https://doi.org/10.1007/s00216-022-04490-w
https://doi.org/10.1186/s13075-023-03204-6
https://doi.org/10.1038/s42255-023-00880-1
https://doi.org/10.3390/cancers14194622
https://doi.org/10.1038/s41467-021-27765-9
StatQuest about the false discovery rate and Benjamini-Hochberg p-value correction.