Data normalization to the internal standards (advanced)
A part of data transformation & normalization
Normalizing to internal standards (IS) in mass spectrometry is crucial for ensuring accurate and reliable quantification of analytes. Ideally, we would use an IS for every analyte we measure; however, this is not feasible in large screening studies such as lipidomics. Instead, we use (at least) one deuterated IS for every lipid class we measure. Free software solutions for dealing with this issue exist, e.g., LipidQuant. You can read more about it in the following article:
Article about the LipidQuant - the software for the quantitation of lipids by D. Wolrab et al.
Internal standards serve as reference points, helping to minimize technical variability, compensate for matrix effects, and correct for losses of analytes during sample preparation and fluctuations in the instrument response. These standards are added in the same amount to every sample at the beginning of the sample preparation process. You can find more about the importance of normalizing signals to internal standards, e.g., on the website of Lipidomics Standards Initiative:
The data set for normalization of intensities to internal standards.
The input dataset should contain analytes/lipids in columns and samples in rows, with the first column containing sample identifiers, as shown in the picture below.
Normalizing to internal standards (IS) in mass spectrometry is crucial for ensuring accurate and reliable quantification of analytes. Ideally, we would use an IS for every analyte we measure; however, this is not feasible in large screening studies such as lipidomics. Instead, we use (at least) one deuterated IS for every lipid class we measure. Free software solutions for dealing with deisotoping and lipid concentrations calculation exist, e.g., LipidQuant. You can read more about it in the following article:
Article about the LipidQuant - the software for the quantitation of lipids by D. Wolrab et al.
Internal standards serve as reference points, helping to minimize technical variability, compensate for matrix effects, and correct for losses of analytes during sample preparation and fluctuations in the instrument response. These standards are added in the same amount to every sample at the beginning of the sample preparation process. You can find more about the importance of normalizing signals to internal standards, e.g., on the website of Lipidomics Standards Initiative:
The example data set for normalization to internal standard.
And the code block:
The input dataset should contain analytes/lipids in columns and samples in rows, with the first column containing sample identifiers, as shown in the picture below:
Preview of the input data table
To work with the data further, we need to transform it into a tidy table form. We will use gather() function for this.
Now, table should look like this, with three columns: Sample, Name, Intensity
We will need additional information for our calculations: which analytes are internal standards and what lipid classes they belong to.
It's important to note that in this example, we expect to use one internal standard for every lipid class. However, there may be situations where we use more than one IS per group. In such cases, we must specify subgroups within the class and associate specific standards accordingly.
In this dataset, our internal standards are all included in the name pattern "_IS". We strongly encourage using a specific annotation pattern for internal standards during analysis to help identify them in large datasets. Here, we use the grep() function to find internal standards and create a new column named "IS", containing logical TRUE/FALSE information about the type of analyte.
The last part of the data preparation involves extracting information about the lipid class. This can be done in several ways, typically depending on the type of lipid annotation your lab uses. In this example, we use a simple approach by extracting everything before a symbol "(". If you require a different approach, we recommend checking the chapter in this Gitbook (Metabolites and Lipids Univariate Statistics in R -> Graphical representation of univariate statistics -> Lipid maps and acyl chain plots), where the detailed method of extracting annotation from lipid names is provided.
Calculate concentration
After preparing our table, we can proceed with the calculations. First, we will prepare a function that divides the intensity of an analyte by the intensity of a corresponding IS. When the function does not find the internal standard, it leaves the intensity as it was.
We will split our data into a list containing small data groups to perform calculations. Data are grouped by the Samples, and Lipid Class.
The list consists of data grouped by Sample, and Lipid Class. Preview of one such group is shown on Figure
Finally, we apply the prepared function normalize_intensity_IS() to the grouped_data. After applying the function to each element of grouped_data, the results are combined into a single data structure. Here, do.call() is used to call the function rbind() (which binds rows together) with the list data_normalized. This effectively combines all the results into a single data structure.
At this point, we have normalized the signal intensities. We can quickly check if the function performed correctly by examining the Intensity_IS column in the data_normalized table, where the IS should always equal 1.
To calculate the concentration of analytes based on the IS concentration, we need to input additional data in the form of a .csv table. This table stores the concentration of every IS used in the batch. The concentration values are analysis-specific and must be consulted with an analytical chemist.
Preview of the table containing the concentration of the internal standards. This table has to be provided separately.
This table should containcolumnsLipClassandIS_conc. In the next step, we will merge tables by the column LipClass, to add the concentration information to every group. Then, we will simply multiply the column Intensity_IS wit the newly added column IS_conc.
The example data set containing concentration of internal standards.
Outputs
To prepare the output table, we have to filter the columns we need, reshape the data, excluding internal standards, and save the reshaped data to a CSV file.
An output of this script is a dataframe with the same constitution as input table.
# Load libraries
library(tidyverse)
# Define the name of the project and working directory
path <- "D:/Data analysis/"
setwd(path)
# Load data {should be in the working directory}
data <- read_csv("Normalization_IS_dataset.csv")
# Calling tidyverse collection
library(tidyverse)
# Define the name of the project and working directory
path <- "D:/Data analysis/"
setwd(path)
# Load data
# Data are preferably present in the working directory
data <- read_csv("demo_data_IS.csv")
# Extract info about Lipid Classes
extract_before <- function(pattern, text) {
result <- sub(paste0("(.*)", pattern, ".*"), "\\1", text)
return(result)
}
pattern <- "\\(" # Pattern to match - "(" in this case)
dataX$LipClass<- NA
dataX$LipClass <- extract_before(pattern, dataX$Name)
dataX$LipClass <- extract_before(pattern, dataX$LipClass)
# Check if Classes are correctly extracted
unique(dataX$LipClass)
# Prepare function that will perform the calculation
normalize_intensity_IS <- function(data) {
if (TRUE %in% data$IS) {
data$Intensity_IS <- data$Intensity / data$Intensity[data$IS == TRUE]
} else {
data$Intensity_IS <- data$Intensity
}
return(data)
}
# Split data into grouped list
grouped_data <- split(dataX, list(dataX$Sample, dataX$LipClass))
# Divide Analyte by Internal standard
data_normalized <- lapply(grouped_data, normalize_intensity_IS)
data_normalized <- do.call(rbind, data_normalized)
rownames(data_normalized) <- NULL
# Load the concentration table to the environment
conc <- read_csv("Concentration.csv")
# Merge the concentration table with data_normalized
data_normalized <- merge(data_normalized,
conc[c("LipClass", "IS_conc")],
by = "LipClass")
# Multiply normalized signal intensities with the concentration of IS
data_normalized$Concentration_nmol_ml <- data_normalized$Intensity_IS * data_normalized$IS_conc
# Reshape Data
data_final <- data_normalized %>%
filter(IS == FALSE) %>%
select(Sample, Name, Concentration_nmol_ml) %>%
spread(key = Name, value = Concentration_nmol_ml)
# Save reshaped data to a file
write.csv(data_final, file = "Data_IS_normalized.csv", row.names = FALSE)