Data normalization to the internal standards (advanced)

A part of data transformation & normalization

PreviousData normalization in R - fundamentals NextBatch effect corrections in R (advanced)

Last updated 4 months ago

Data normalization to the internal standards (advanced)

A part of data transformation & normalization

Normalizing to internal standards (IS) in mass spectrometry is crucial for ensuring accurate and reliable quantification of analytes. Ideally, we would use an IS for every analyte we measure; however, this is not feasible in large screening studies such as lipidomics. Instead, we use (at least) one deuterated IS for every lipid class we measure. Free software solutions for dealing with this issue exist, e.g., LipidQuant. You can read more about it in the following article:

Internal standards serve as reference points, helping to minimize technical variability, compensate for matrix effects, and correct for losses of analytes during sample preparation and fluctuations in the instrument response. These standards are added in the same amount to every sample at the beginning of the sample preparation process. You can find more about the importance of normalizing signals to internal standards, e.g., on the website of Lipidomics Standards Initiative:

Or in the following articles:

In the following section, we provide a simple code based on the tidyverse package in R to normalize your data to IS.

Data preparation

First, we call the tidyverse collection, specify the path to the working directory, and load the example dataset:

# Load libraries
library(tidyverse)

# Define the name of the project and working directory
path <- "D:/Data analysis/"
setwd(path)

# Load data {should be in the working directory}
data <- read_csv("Normalization_IS_dataset.csv")

The input dataset should contain analytes/lipids in columns and samples in rows, with the first column containing sample identifiers, as shown in the picture below.

Normalizing to internal standards (IS) in mass spectrometry is crucial for ensuring accurate and reliable quantification of analytes. Ideally, we would use an IS for every analyte we measure; however, this is not feasible in large screening studies such as lipidomics. Instead, we use (at least) one deuterated IS for every lipid class we measure. Free software solutions for dealing with deisotoping and lipid concentrations calculation exist, e.g., LipidQuant. You can read more about it in the following article:

In the following section, we provide a simple code based on the tidyverse package in R to normalize your data to IS.

Data Preparation

First, we call the tidyverse collection, specify the path to the working directory, and load the dataset:

And the code block:

# Calling tidyverse collection
library(tidyverse)

# Define the name of the project and working directory
path <- "D:/Data analysis/"
setwd(path)

# Load data 
# Data are preferably present in the working directory
data <- read_csv("demo_data_IS.csv")

The input dataset should contain analytes/lipids in columns and samples in rows, with the first column containing sample identifiers, as shown in the picture below:

To work with the data further, we need to transform it into a tidy table form. We will use gather() function for this.

# Prepare tidy table
dataX <- gather(data, 2:ncol(data), key = 'Name', value = 'Intensity')

We will need additional information for our calculations: which analytes are internal standards and what lipid classes they belong to.

It's important to note that in this example, we expect to use one internal standard for every lipid class. However, there may be situations where we use more than one IS per group. In such cases, we must specify subgroups within the class and associate specific standards accordingly.

In this dataset, our internal standards are all included in the name pattern "_IS". We strongly encourage using a specific annotation pattern for internal standards during analysis to help identify them in large datasets. Here, we use the grep() function to find internal standards and create a new column named "IS", containing logical TRUE/FALSE information about the type of analyte.

# Find Internal standards
dataX$IS <- NA
dataX[grep("_IS", dataX$Name),]$IS <- TRUE
dataX$IS[is.na(dataX$IS)==TRUE] <- FALSE

The last part of the data preparation involves extracting information about the lipid class. This can be done in several ways, typically depending on the type of lipid annotation your lab uses. In this example, we use a simple approach by extracting everything before a symbol "(". If you require a different approach, we recommend checking the chapter in this Gitbook (Metabolites and Lipids Univariate Statistics in R -> Graphical representation of univariate statistics -> Lipid maps and acyl chain plots), where the detailed method of extracting annotation from lipid names is provided.

# Extract info about Lipid Classes
extract_before <- function(pattern, text) {
      result <- sub(paste0("(.*)", pattern, ".*"), "\\1", text)
      return(result)
}
 
pattern <- "\\("  # Pattern to match - "(" in this case)
dataX$LipClass<- NA
dataX$LipClass <- extract_before(pattern, dataX$Name)
dataX$LipClass <- extract_before(pattern, dataX$LipClass)
    
# Check if Classes are correctly extracted
unique(dataX$LipClass)

Calculate concentration

After preparing our table, we can proceed with the calculations. First, we will prepare a function that divides the intensity of an analyte by the intensity of a corresponding IS. When the function does not find the internal standard, it leaves the intensity as it was.

# Prepare function that will perform the calculation
    normalize_intensity_IS <- function(data) {
      if (TRUE %in% data$IS) {
        data$Intensity_IS <- data$Intensity / data$Intensity[data$IS == TRUE]
      } else {
        data$Intensity_IS <- data$Intensity
      }
      return(data)
    }

We will split our data into a list containing small data groups to perform calculations. Data are grouped by the Samples, and Lipid Class.

# Split data into grouped list
    grouped_data <- split(dataX, list(dataX$Sample, dataX$LipClass))

Finally, we apply the prepared function normalize_intensity_IS() to the grouped_data. After applying the function to each element of grouped_data, the results are combined into a single data structure. Here, do.call() is used to call the function rbind() (which binds rows together) with the list data_normalized. This effectively combines all the results into a single data structure.

# Divide Analyte by Internal standard
    data_normalized <- lapply(grouped_data, normalize_intensity_IS)
    data_normalized <- do.call(rbind, data_normalized)
    rownames(data_normalized) <- NULL

At this point, we have normalized the signal intensities. We can quickly check if the function performed correctly by examining the Intensity_IS column in the data_normalized table, where the IS should always equal 1.

To calculate the concentration of analytes based on the IS concentration, we need to input additional data in the form of a .csv table. This table stores the concentration of every IS used in the batch. The concentration values are analysis-specific and must be consulted with an analytical chemist.

This table should contain columns LipClass and IS_conc. In the next step, we will merge tables by the column LipClass, to add the concentration information to every group. Then, we will simply multiply the column Intensity_IS wit the newly added column IS_conc.

# Load the concentration table to the environment
    conc <- read_csv("Concentration.csv")
 
# Merge the concentration table with data_normalized
    data_normalized <- merge(data_normalized, 
                             conc[c("LipClass", "IS_conc")], 
                             by = "LipClass")
 
# Multiply normalized signal intensities with the concentration of IS
    data_normalized$Concentration_nmol_ml <- data_normalized$Intensity_IS * data_normalized$IS_conc

Outputs

To prepare the output table, we have to filter the columns we need, reshape the data, excluding internal standards, and save the reshaped data to a CSV file.

# Reshape Data
    data_final <- data_normalized %>%
      filter(IS == FALSE) %>%
      select(Sample, Name, Concentration_nmol_ml) %>%
      spread(key = Name, value = Concentration_nmol_ml)
    
# Save reshaped data to a file
  write.csv(data_final, file = "Data_IS_normalized.csv", row.names = FALSE)

PreviousData normalization in R - fundamentals NextBatch effect corrections in R (advanced)

Last updated 4 months ago