Data normalization in R - fundamentals

A part of data transformation & normalization

PreviousData imputation NextData normalization to the internal standards (advanced)

Last updated 2 months ago

Data normalization in R - fundamentals

A part of data transformation & normalization

Introduction

Normalization in metabolomics and lipidomics primarily aims to minimize the effects of variation caused by biological, technical, pre-analytical, and analytical factors. This variation can arise from differences in sample preparation, instrument performance, biofluid dilution, or other aspects unrelated to the actual (biological) differences under study. Multiple strategies have been attempted to normalize samples and combat technical errors. These can be classified into three categories: data-driven normalizations, internal standards (IS)-based normalizations, and quality control samples (QC)-based normalizations.

In lipidomics (metabolomics), the term "data normalization" is often interpreted as obtaining lipid/metabolite concentrations (analytical standard normalization), eliminating batch effects (batch effect normalization), or, one could say, "managing analytical variations", e.g., also related to sample aliquots to be used for extraction/analysis (pre-acquisition sample amount normalizations). However, in reality, this is a limited interpretation of the term "data normalization", which in other -omics sciences also covers, e.g., all statistical normalizations, including post-acquisition normalizations, which can be applied if pre-acquisition normalizations fail, are not used, or difficult to select, e.g., for samples like saliva, breath, urine, stool, etc. Further, as mentioned above, 'normalizations' can consider sources of variance other than just analytical, e.g., pre-analytical (related to sample collection or storage), or unwanted biological variation elimination.

Nonetheless, analytical chemists in lipidomics and metabolomics, in most cases, rely on:

Pre-acquisition normalization of sample aliquots, e.g., by volume, mass, area, cell count, protein, DNA, and metabolite-concentration-based normalizations,
Normalization against analytical standards spiked within extraction (standard-based normalizations) - pre-acquisition normalization for analytical variation,
Batch effect normalizations.

We also encourage you to check the following references for more information:

1) G. Olshansky et al. Challenges and opportunities for prevention and removal of unwanted variation in lipidomic studies. Progress in Lipid Research (2022). DOI:

2) Y. Wu & L. Li. Sample normalization methods in quantitative metabolomics. Journal of Chromatography A (2016). DOI:

3) B. Low et al. Closing the Knowledge Gap of Post-Acquisition Sample Normalization in Untargeted Metabolomics. ACS Measurement Science Au (2024). DOI:

4) Lipidomics Standards Initiative (LSI) Consortium. Lipid Species Quantification

5) Lipidomics Standards Initiative (LSI) Consortium. Lipidomics needs more standardization. Nature Metabolism (2019). DOI:

6) B. Drotleff & M. Lämmerhofer. Guidelines for Selection of Internal Standard-Based Normalization Strategies in Untargeted Lipidomic Profiling by LC-HR-MS/MS. Analytical Chemistry (2019). DOI:

7) M. Wang et al. Selection of internal standards for accurate quantification of complex lipid species in biological extracts by electrospray ionization mass spectrometry – What, how and why? Mass Spectrometry Reviews (2016). DOI:

8) H. C. Köfeler et al. Recommendations for good practice in MS-based lipidomics. Journal of Lipid Research (2021). DOI:

Pre-analytical variation

Pre-analytical variation is a well-recognized concern in omics fields beyond lipidomics and metabolomics, where specific endogenous genes or proteins, known as "housekeeping genes/proteins," are used to address it. These genes or proteins are considered unrelated to the biological variation of interest. Their presence or absence indicates potential issues in sample collection, handling, and storage, allowing for proper assessment of sources of pre-analytical variation. While sample collection and storage significantly impact the final lipidomic or metabolomic analysis results, no control lipid or metabolite features have been widely adopted thus far. One should also consider that different lipid/metabolite classes can be differently affected at these steps. Therefore, the pre-analytical variation is hardly accounted for in lipidomics and metabolomics studies, compared, e.g., to proper pre- and post-acquisition sample amount normalizations, pre- or post-acquisition analytical variation normalizations (with clear guidelines and recommendations from Lipidomics and Metabolomics Societies), and accounting for unwanted biological variation.

Biological variation

While it is impossible to account for all biological factors, we can reduce the impact of some of them. Normalization for biofluids such as urine usually involves adjusting concentrations relative to creatinine levels or osmolality. This step is crucial to account for high concentration variability in urine due to different hydration statuses and diurnal variations. In cell culture experiments, cell count or total protein content normalization is used to account for differences in cell number. Regarding different types of biological materials, analyte concentration is typically adjusted to volume in biofluids, weight in tissues, and cell count or total protein/DNA concentrations in cells. Additionally, sophisticated normalization techniques, like probabilistic quotient normalization (PQN), may be employed to correct for sample dilution precisely, enhancing the accuracy of analyte quantification:

Technical variation

The entire process, from sample collection and storage to sample preparation, can introduce various factors contributing to unwanted variation. During sample preparation, it is crucial to use internal standards which can effectively account for several potential issues, such as human factor, pipetting reproducibility, and fluctuation in instrument performance. The correct use of internal standards is crucial, and we advise following recommendations from Metabolomics Quality Assurance & Quality Control Consortium (mQACC) and Lipidomics Standard Initiative (LSI).

Analytical variation

Apart from the variation induced by sample preparation, there is also variation from the instrumental analysis. Various factors can introduce variation, such as instrument performance fluctuation (e.g., detector response changes), data acquisition time, technical errors, batch effects, etc. To account for these factors, QC-based normalization approaches are frequently employed. A QC sample, typically a pooled sample prepared by combining aliquots from individual samples within the same batch, is measured in a sequence. The measurement typically occurs every 5th to 10th injection, with the frequency depending on the batch size. There are several approaches how to perform QC-based correction such as locally estimated scatterplot smoothing (LOESS) or systematic error removal using random forest (SERRF).

The aforementioned approaches can be used to correct signal drift within a single batch and between two or more batches. Some workflows apply total-ion-current filtering, QC-robust spline batch correction, and spectral cleaning to reduce analytical variation.

One of these approaches using R, MRMkit, introduced an all-in-one tool for fully automated reproducible peak integration (considering retention time offset patterns), data normalization, quality metrics reporting, visualizations for fast data quality evaluation, and batch correction.

PreviousData imputation NextData normalization to the internal standards (advanced)

Last updated 2 months ago