💪
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  • Data preparation
  • Data Preparation
  • Calculate concentration
  • Outputs
  1. Data transformation, scaling, and normalization in R

Data normalization to the internal standards (advanced)

A part of data transformation & normalization

PreviousData normalization in R - fundamentalsNextBatch effect corrections in R (advanced)

Last updated 4 months ago

Normalizing to internal standards (IS) in mass spectrometry is crucial for ensuring accurate and reliable quantification of analytes. Ideally, we would use an IS for every analyte we measure; however, this is not feasible in large screening studies such as lipidomics. Instead, we use (at least) one deuterated IS for every lipid class we measure. Free software solutions for dealing with this issue exist, e.g., LipidQuant. You can read more about it in the following article:

Internal standards serve as reference points, helping to minimize technical variability, compensate for matrix effects, and correct for losses of analytes during sample preparation and fluctuations in the instrument response. These standards are added in the same amount to every sample at the beginning of the sample preparation process. You can find more about the importance of normalizing signals to internal standards, e.g., on the website of Lipidomics Standards Initiative:

Or in the following articles:

In the following section, we provide a simple code based on the tidyverse package in R to normalize your data to IS.

Data preparation

First, we call the tidyverse collection, specify the path to the working directory, and load the example dataset:

# Load libraries
library(tidyverse)

# Define the name of the project and working directory
path <- "D:/Data analysis/"
setwd(path)

# Load data {should be in the working directory}
data <- read_csv("Normalization_IS_dataset.csv")

The input dataset should contain analytes/lipids in columns and samples in rows, with the first column containing sample identifiers, as shown in the picture below.

Normalizing to internal standards (IS) in mass spectrometry is crucial for ensuring accurate and reliable quantification of analytes. Ideally, we would use an IS for every analyte we measure; however, this is not feasible in large screening studies such as lipidomics. Instead, we use (at least) one deuterated IS for every lipid class we measure. Free software solutions for dealing with deisotoping and lipid concentrations calculation exist, e.g., LipidQuant. You can read more about it in the following article:

Internal standards serve as reference points, helping to minimize technical variability, compensate for matrix effects, and correct for losses of analytes during sample preparation and fluctuations in the instrument response. These standards are added in the same amount to every sample at the beginning of the sample preparation process. You can find more about the importance of normalizing signals to internal standards, e.g., on the website of Lipidomics Standards Initiative:

In the following section, we provide a simple code based on the tidyverse package in R to normalize your data to IS.

Data Preparation

First, we call the tidyverse collection, specify the path to the working directory, and load the dataset:

And the code block:

# Calling tidyverse collection
library(tidyverse)

# Define the name of the project and working directory
path <- "D:/Data analysis/"
setwd(path)

# Load data 
# Data are preferably present in the working directory
data <- read_csv("demo_data_IS.csv")

The input dataset should contain analytes/lipids in columns and samples in rows, with the first column containing sample identifiers, as shown in the picture below:

To work with the data further, we need to transform it into a tidy table form. We will use gather() function for this.

# Prepare tidy table
dataX <- gather(data, 2:ncol(data), key = 'Name', value = 'Intensity')

We will need additional information for our calculations: which analytes are internal standards and what lipid classes they belong to.

It's important to note that in this example, we expect to use one internal standard for every lipid class. However, there may be situations where we use more than one IS per group. In such cases, we must specify subgroups within the class and associate specific standards accordingly.

In this dataset, our internal standards are all included in the name pattern "_IS". We strongly encourage using a specific annotation pattern for internal standards during analysis to help identify them in large datasets. Here, we use the grep() function to find internal standards and create a new column named "IS", containing logical TRUE/FALSE information about the type of analyte.

# Find Internal standards
dataX$IS <- NA
dataX[grep("_IS", dataX$Name),]$IS <- TRUE
dataX$IS[is.na(dataX$IS)==TRUE] <- FALSE

The last part of the data preparation involves extracting information about the lipid class. This can be done in several ways, typically depending on the type of lipid annotation your lab uses. In this example, we use a simple approach by extracting everything before a symbol "(". If you require a different approach, we recommend checking the chapter in this Gitbook (Metabolites and Lipids Univariate Statistics in R -> Graphical representation of univariate statistics -> Lipid maps and acyl chain plots), where the detailed method of extracting annotation from lipid names is provided.

# Extract info about Lipid Classes
extract_before <- function(pattern, text) {
      result <- sub(paste0("(.*)", pattern, ".*"), "\\1", text)
      return(result)
}
 
pattern <- "\\("  # Pattern to match - "(" in this case)
dataX$LipClass<- NA
dataX$LipClass <- extract_before(pattern, dataX$Name)
dataX$LipClass <- extract_before(pattern, dataX$LipClass)
    
# Check if Classes are correctly extracted
unique(dataX$LipClass)

Calculate concentration

After preparing our table, we can proceed with the calculations. First, we will prepare a function that divides the intensity of an analyte by the intensity of a corresponding IS. When the function does not find the internal standard, it leaves the intensity as it was.

# Prepare function that will perform the calculation
    normalize_intensity_IS <- function(data) {
      if (TRUE %in% data$IS) {
        data$Intensity_IS <- data$Intensity / data$Intensity[data$IS == TRUE]
      } else {
        data$Intensity_IS <- data$Intensity
      }
      return(data)
    }

We will split our data into a list containing small data groups to perform calculations. Data are grouped by the Samples, and Lipid Class.

# Split data into grouped list
    grouped_data <- split(dataX, list(dataX$Sample, dataX$LipClass))

Finally, we apply the prepared function normalize_intensity_IS() to the grouped_data. After applying the function to each element of grouped_data, the results are combined into a single data structure. Here, do.call() is used to call the function rbind() (which binds rows together) with the list data_normalized. This effectively combines all the results into a single data structure.

# Divide Analyte by Internal standard
    data_normalized <- lapply(grouped_data, normalize_intensity_IS)
    data_normalized <- do.call(rbind, data_normalized)
    rownames(data_normalized) <- NULL

At this point, we have normalized the signal intensities. We can quickly check if the function performed correctly by examining the Intensity_IS column in the data_normalized table, where the IS should always equal 1.

To calculate the concentration of analytes based on the IS concentration, we need to input additional data in the form of a .csv table. This table stores the concentration of every IS used in the batch. The concentration values are analysis-specific and must be consulted with an analytical chemist.

This table should contain columns LipClass and IS_conc. In the next step, we will merge tables by the column LipClass, to add the concentration information to every group. Then, we will simply multiply the column Intensity_IS wit the newly added column IS_conc.

# Load the concentration table to the environment
    conc <- read_csv("Concentration.csv")
 
# Merge the concentration table with data_normalized
    data_normalized <- merge(data_normalized, 
                             conc[c("LipClass", "IS_conc")], 
                             by = "LipClass")
 
# Multiply normalized signal intensities with the concentration of IS
    data_normalized$Concentration_nmol_ml <- data_normalized$Intensity_IS * data_normalized$IS_conc

Outputs

To prepare the output table, we have to filter the columns we need, reshape the data, excluding internal standards, and save the reshaped data to a CSV file.

# Reshape Data
    data_final <- data_normalized %>%
      filter(IS == FALSE) %>%
      select(Sample, Name, Concentration_nmol_ml) %>%
      spread(key = Name, value = Concentration_nmol_ml)
    
# Save reshaped data to a file
  write.csv(data_final, file = "Data_IS_normalized.csv", row.names = FALSE)
LipidQuant 1.0: automated data processing in lipid class separation–mass spectrometry quantitative workflowsOUP Academic
Article about the LipidQuant - the software for the quantitation of lipids by D. Wolrab et al.
LipidQuant 1.0: automated data processing in lipid class separation–mass spectrometry quantitative workflowsOUP Academic
Article about the LipidQuant - the software for the quantitation of lipids by D. Wolrab et al.
Lipid Species Quantificationlipidomicstandards.org
Lipidomics Standards Initiative about the quantitation of lipid species.
Lipid Species Quantificationlipidomicstandards.org
Lipidomics Standards Initiative about the quantitation of lipid species.
Lipidomics needs more standardizationNature
Lipidomics Standards Initiative about the quantitation of lipid species in Nature Metabolism.
1MB
Normalization_IS_dataset.csv
The data set for normalization of intensities to internal standards.
1MB
demo_data_IS.csv
The example data set for normalization to internal standard.
372B
Concentration.csv
The example data set containing concentration of internal standards.
Preview of the input data table
Now, table should look like this, with three columns: Sample, Name, Intensity
The list consists of data grouped by Sample, and Lipid Class. Preview of one such group is shown on Figure
Preview of the table containing the concentration of the internal standards. This table has to be provided separately.
An output of this script is a dataframe with the same constitution as input table.
Logo
Logo
Logo
Logo
Logo