đź’Ş
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  • Introduction
  • Pre-analytical variation
  • Biological variation
  • Technical variation
  • Analytical variation
  1. Data transformation, scaling, and normalization in R

Data normalization in R - fundamentals

A part of data transformation & normalization

PreviousData imputationNextData normalization to the internal standards (advanced)

Last updated 2 months ago

Introduction

Normalization in metabolomics and lipidomics primarily aims to minimize the effects of variation caused by biological, technical, pre-analytical, and analytical factors. This variation can arise from differences in sample preparation, instrument performance, biofluid dilution, or other aspects unrelated to the actual (biological) differences under study. Multiple strategies have been attempted to normalize samples and combat technical errors. These can be classified into three categories: data-driven normalizations, internal standards (IS)-based normalizations, and quality control samples (QC)-based normalizations.

In lipidomics (metabolomics), the term "data normalization" is often interpreted as obtaining lipid/metabolite concentrations (analytical standard normalization), eliminating batch effects (batch effect normalization), or, one could say, "managing analytical variations", e.g., also related to sample aliquots to be used for extraction/analysis (pre-acquisition sample amount normalizations). However, in reality, this is a limited interpretation of the term "data normalization", which in other -omics sciences also covers, e.g., all statistical normalizations, including post-acquisition normalizations, which can be applied if pre-acquisition normalizations fail, are not used, or difficult to select, e.g., for samples like saliva, breath, urine, stool, etc. Further, as mentioned above, 'normalizations' can consider sources of variance other than just analytical, e.g., pre-analytical (related to sample collection or storage), or unwanted biological variation elimination.

Nonetheless, analytical chemists in lipidomics and metabolomics, in most cases, rely on:

  • Pre-acquisition normalization of sample aliquots, e.g., by volume, mass, area, cell count, protein, DNA, and metabolite-concentration-based normalizations,

  • Normalization against analytical standards spiked within extraction (standard-based normalizations) - pre-acquisition normalization for analytical variation,

  • Batch effect normalizations.

We also encourage you to check the following references for more information:

1) G. Olshansky et al. Challenges and opportunities for prevention and removal of unwanted variation in lipidomic studies. Progress in Lipid Research (2022). DOI:

2) Y. Wu & L. Li. Sample normalization methods in quantitative metabolomics. Journal of Chromatography A (2016). DOI:

3) B. Low et al. Closing the Knowledge Gap of Post-Acquisition Sample Normalization in Untargeted Metabolomics. ACS Measurement Science Au (2024). DOI:

4) Lipidomics Standards Initiative (LSI) Consortium. Lipid Species Quantification

5) Lipidomics Standards Initiative (LSI) Consortium. Lipidomics needs more standardization. Nature Metabolism (2019). DOI:

6) B. Drotleff & M. Lämmerhofer. Guidelines for Selection of Internal Standard-Based Normalization Strategies in Untargeted Lipidomic Profiling by LC-HR-MS/MS. Analytical Chemistry (2019). DOI:

7) M. Wang et al. Selection of internal standards for accurate quantification of complex lipid species in biological extracts by electrospray ionization mass spectrometry – What, how and why? Mass Spectrometry Reviews (2016). DOI:

8) H. C. Köfeler et al. Recommendations for good practice in MS-based lipidomics. Journal of Lipid Research (2021). DOI:

Pre-analytical variation

Pre-analytical variation is a well-recognized concern in omics fields beyond lipidomics and metabolomics, where specific endogenous genes or proteins, known as "housekeeping genes/proteins," are used to address it. These genes or proteins are considered unrelated to the biological variation of interest. Their presence or absence indicates potential issues in sample collection, handling, and storage, allowing for proper assessment of sources of pre-analytical variation. While sample collection and storage significantly impact the final lipidomic or metabolomic analysis results, no control lipid or metabolite features have been widely adopted thus far. One should also consider that different lipid/metabolite classes can be differently affected at these steps. Therefore, the pre-analytical variation is hardly accounted for in lipidomics and metabolomics studies, compared, e.g., to proper pre- and post-acquisition sample amount normalizations, pre- or post-acquisition analytical variation normalizations (with clear guidelines and recommendations from Lipidomics and Metabolomics Societies), and accounting for unwanted biological variation.

Biological variation

While it is impossible to account for all biological factors, we can reduce the impact of some of them. Normalization for biofluids such as urine usually involves adjusting concentrations relative to creatinine levels or osmolality. This step is crucial to account for high concentration variability in urine due to different hydration statuses and diurnal variations. In cell culture experiments, cell count or total protein content normalization is used to account for differences in cell number. Regarding different types of biological materials, analyte concentration is typically adjusted to volume in biofluids, weight in tissues, and cell count or total protein/DNA concentrations in cells. Additionally, sophisticated normalization techniques, like probabilistic quotient normalization (PQN), may be employed to correct for sample dilution precisely, enhancing the accuracy of analyte quantification:

Technical variation

The entire process, from sample collection and storage to sample preparation, can introduce various factors contributing to unwanted variation. During sample preparation, it is crucial to use internal standards which can effectively account for several potential issues, such as human factor, pipetting reproducibility, and fluctuation in instrument performance. The correct use of internal standards is crucial, and we advise following recommendations from Metabolomics Quality Assurance & Quality Control Consortium (mQACC) and Lipidomics Standard Initiative (LSI).

Analytical variation

Apart from the variation induced by sample preparation, there is also variation from the instrumental analysis. Various factors can introduce variation, such as instrument performance fluctuation (e.g., detector response changes), data acquisition time, technical errors, batch effects, etc. To account for these factors, QC-based normalization approaches are frequently employed. A QC sample, typically a pooled sample prepared by combining aliquots from individual samples within the same batch, is measured in a sequence. The measurement typically occurs every 5th to 10th injection, with the frequency depending on the batch size. There are several approaches how to perform QC-based correction such as locally estimated scatterplot smoothing (LOESS) or systematic error removal using random forest (SERRF).

The aforementioned approaches can be used to correct signal drift within a single batch and between two or more batches. Some workflows apply total-ion-current filtering, QC-robust spline batch correction, and spectral cleaning to reduce analytical variation.

One of these approaches using R, MRMkit, introduced an all-in-one tool for fully automated reproducible peak integration (considering retention time offset patterns), data normalization, quality metrics reporting, visualizations for fast data quality evaluation, and batch correction.

9) M. HolÄŤapek et al. Lipidomic analysis. Analytical Chemistry (2018). DOI:

Based on: G. Olshansky et al. Challenges and opportunities for prevention and removal of unwanted variation in lipidomic studies. Progress in Lipid Research (2022). DOI:

https://doi.org/10.1016/j.plipres.2022.101177
https://doi.org/10.1016/j.chroma.2015.12.007
https://doi.org/10.1021/acsmeasuresciau.4c00047
https://lipidomicstandards.org/lipid-species-quantification/
https://doi.org/10.1038/s42255-019-0094-z
https://doi.org/10.1021/acs.analchem.9b01505
https://doi.org/10.1002/mas.21492
https://doi.org/10.1016/j.jlr.2021.100138
https://doi.org/10.1021/acs.analchem.7b05395
https://doi.org/10.1016/j.plipres.2022.101177
Probabilistic Quotient Normalization as Robust Method to Account for Dilution of Complex Biological Mixtures. Application in 1H NMR MetabonomicsACS Publications
Logo
Performance Evaluation and Online Realization of Data-driven Normalization Methods Used in LC/MS based Untargeted Metabolomics AnalysisNature
mQACCmQACC
Logo
Homelipidomicstandards.org
Locally Weighted Regression: An Approach to Regression Analysis by Local FittingTaylor & Francis
Logo
Systematic Error Removal Using Random Forest for Normalizing Large-Scale Untargeted Lipidomics DataACS Publications
Logo
Characterising and correcting batch variation in an automated direct infusion mass spectrometry (DIMS) metabolomics workflowSpringerLink
Logo
MRMkit: Automated Data Processing for Large-Scale Targeted Metabolomics AnalysisACS Publications
Logo
Logo
Logo