Batch effect corrections in R (advanced)

A part of preprocessing data as transformation & normalization

Introduction

In this section, we will discuss the use of QC-based algorithms for batch correction. By using QC sample data, we can model and correct for any systematic measurement bias. When plotting the measured response for a particular metabolite or lipid against injection order (excluding conditioning samples and blanks), it is often possible to observe time-related systematic variation in the reported metabolite response. This systematic error can arise due to non-enzymatic metabolite conversion (e.g. degradation, oxidation or hydrolysis) of samples in the autosampler, or due to changes in the properties of the analytical platform caused by changes in chromatography (such as retention time or peak shape), or interaction of the sample components with the surfaces of the chromatography system and MS instrumentation (e.g. column, cones, ion skimmers, transfer capillaries), which can influence the measured response. We advise you to read an article from David Broadhurst et al. focusing on guidelines and recommendations for the use of system suitability and quality control samples in metabolomics:

In the next lines we will focus on two commonly used algorithms, namely locally estimated scatterplot smoothing (LOESS) and systematic error removal using random forest (SERRF).

Locally estimated scatterplot smoothing (LOESS)

Locally Estimated Scatterplot Smoothing (LOESS) is a widely used QC-based normalization algorithm in metabolomics and lipidomics, designed to correct for systematic biases and variability in high-throughput analytical data. The essence of LOESS in this context lies in its ability to adjust for non-linear trends and variations across the data set, ensuring that the measurements are consistent and reliable over the course of the experimental runs. The LOESS algorithm operates by fitting smooth curves to the relationship between the measured intensities (or other relevant metrics) of the quality control (QC) samples and their position within the analytical sequence (e.g., run order). This process is critical because it addresses the instrumental drift and batch effects that commonly occur in mass spectrometry-based analyses, which can lead to significant biases if left uncorrected. For more detailed information you can access the original statistical article from William S. Cleveland and Susan J. Devlin below:

Or additionally here is an article from Mónica Calderón-Santiago et al. featuring MetaboQC, a tool for correction of experimental variability associated to the instrumental quantitative response utilizing LOESS:

In the next lines we will focus on the application of LOESS for data normalization in R.

A practical guide for LOESS normalization in R

The following lines of code will show you how to do LOESS normalization of your data using R. The demo data represent a metabolomics/lipidomics experiment with 100 variables (Met1-Met100) and 32 observations (9 patients - Con, 14 patients - Pat and 8 QC samples - QC). Samples are sorted by analytical sequence with a QC sample analyzed as every 5th injection.

Below you can follow the code. First, we will load the package "tidyverse" and set the name, group names and path to our data.

library(tidyverse)

name <- "Experiment"
groupnames <- c("Con", "Pat", "QC")
path <- "/Users/dominikaolesova/R/LOESS script"
setwd(path)

#Read and convert data
data <- read_csv("demo_data.csv", col_types = cols()) %>%
  column_to_rownames(var = "Compound")

Your data loaded to R now should look like this:

Loaded demo data for LOESS

Next, we will need to prepare the data for the LOESS smoothing. First, we need to create the QC matrix and specify the groups excluding QC.

#Select columns with QC samples and create a QC data matrix
dataQC <- data %>%
  select(contains("QC"))
QCi <- grep("QC", colnames(data))

# Specify groups excluding QC
groupnames <- groupnames[!groupnames %in% "QC"]
count <- length(groupnames)

Now, we can perform the LOESS smoothing. LOESS fits a polynomial to small subsets of the data by least squares regression, focusing on points near the target point.

# LOESS smoothing
r <- map_df(.x = 1:nrow(data), ~ {
  dat <- data[.x, ]
  smooth <- loess.smooth(x = QCi, y = dataQC[.x, ], 
                         span = 0.75, degree = 2, evaluation = ncol(data), 
                         family = "gaussian")
  smo <- smooth$y
  dat / smo
  
})

Next, the smoothing factors must be calculated and bonded to normalize data.

# Prepare table with smoothing factors
rr <- map_df(1:nrow(data), ~ {
  dat <- data[.x, ]
  smooth <- loess.smooth(QCi, dataQC[.x, ], 
                         span = 0.75, degree = 2, evaluation = ncol(data), 
                         family = "gaussian")
  smomax <- max(smooth$y)
  smomin <- min(smooth$y)
  tibble(sm_fact = smomax / smomin)
})

# Bind smoothing factors with normalized data
data.loess.sf <- as.matrix(bind_cols(rr, r))


# Exclude compounds with negative smoothing factor
data.final <- data.loess.sf %>% 
  as.data.frame() %>%
  filter(sm_fact > 0) %>%
  select(-sm_fact)

Finally, you can create a .csv file with your final normalized data and generate plots.

# Write final table into working directory
write_csv(data.final, paste(name, "_loess.csv", sep = ""))

# Save Plots into /Results directory
setwd(path)
dir = file.path(path, "/Results" , sep = "")
dir.create(dir)
setwd(dir)

tps = c("Raw", "Smooth")

for (tp in tps) {
  if (tp == "Raw") {
    df <- data # for raw data
    data.x <- df %>% 
      t() %>% as.data.frame() %>%
      rownames_to_column("sample") %>%
      rownames_to_column("order") %>%
      gather("compound", "intensity", 3:(nrow(df)+2))
  } else {
    df <- data.final # for smoothed data
    data.x <- df %>% 
      t() %>% as.data.frame() %>%
      rownames_to_column("sample") %>%
      rownames_to_column("order") %>%
      gather("compound", "intensity", 3:(nrow(df)+2))
  }
  ord <- unique(data.x$sample)
  
  groups <- c("QC",groupnames)
  data.x$group <- NA
  for (j in groups){
    data.x$group[grep(j, data.x$sample)] <- j
  }
  
  compounds <- rownames(df)
  for (i in compounds) {
    data.x %>%
      filter(compound == i) %>%
      ggplot(aes(x = factor(sample, levels = ord), y = intensity, 
                 color = factor(group, levels = groups))) +
      geom_point(size = 3) +
      scale_color_manual(values = c("red", "blue", "orange", "violet", "lightblue", "darkgreen", "pink"))+
      labs(x = "Samples", y = NULL, title = i)+ 
      guides(color=guide_legend(title="Group")) +
      theme_linedraw() +
      theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
    
    ggsave(paste(i, "_",tp,".pdf", sep = ""))
    }
 }

A "Results" directory will be created and for each variable (metabolite/lipid) a plot will be generated and saved in .pdf showing raw and normalized data shown as points (intensity, area or different value of the variable on the y-axis and order of analytical sequence with labels on x-axis) with an example shown below.

Data as values on the y-axis and samples labelled and ordered by the analytical sequence on the x-axis are shown for raw data (left) and normalized data by LOESS (right).

Systematic error removal using random forest (SERRF)

A newly developed QC-based normalization algorithm, called systematic error removal using random forest (SERRF), has been shown to effectively correct for batch effects and time-dependent drifts in large-scale metabolomics/lipidomics cohort studies. SERRF has a distinct advantage over other commonly used normalization methods in that it can utilize information from all correlating compounds when normalizing each individual lipid/metabolite. Unlike traditional QC-based approaches, SERRF recognizes that systematic errors are not only associated with batch effects and injection order, but also with the behavior of other compounds. By utilizing the RF algorithm, SERRF automatically selects correlated compounds to effectively normalize systematic errors identified by the QC samples for each individual compound. Detailed information about the SERRF algorithm can be found in the original publication by Sili Fan et al.:

SERRF is also available as a free web-based tool:

Below we will demonstrate the basics of SERRF using a demo dataset via the web-based environment. For detailed information regarding the R-code behind the SERRF normalization, you can access it here:

Guide for the SERRF normalization using the web-based environment

When you land on the main page of SERRF (https://slfan2013.github.io/SERRF-online/) you will find detailed information regarding the normalization capabilities of this algorithm. To access the tool itself you need to click on “Use SERRF!” link (highlighted by the red arrow).

The main webpage of the SERRF normalization tool.

You will be presented with a page where you can upload your dataset by clicking the "Choose File" button. During the creation of this tutorial, the main server was under maintenance so a secondary server was used for the demonstration (accessible under a link on the website or here: https://slfan.shinyapps.io/ShinySERRF/).

Page for upload of the dataset to be normalized (main server).
Page for upload of the dataset to be normalized (alternative server).

The dataset you want to normalize needs to follow a standard format identical to the dataset used in the original publication (https://slfan2013.github.io/SERRF-online/SERRF%20example%20dataset.xlsx).

  1. The dataset must be in the first sheet of a .csv file.

  2. The following red-colored text must be included as is. (Text sensitive, order sensitive and case sensitive). See following:

  • batch: can be defined as time intervals, machines, etc. Samples usually are run in different machines, time periods and labs. These can all be treated as batches. Please note it is observed that batch information is usually important for SERRF to perform well.

  • sampleType: either 'qc' , 'sample' or 'validate' . The dataset must contain 'qc' and 'sample'. The 'validate' is optional in case you would like to have external quality-control validation samples. The minimum number of qc per batch is 5. Ideally, qc/sample number ratio is > 1:10, i.e. at least 1 qc per 10 samples.

  • time: is the running sequence/injection order of the qc and samples. Must be continous number.

  • label: The sample label corresponding to each column and compound label corresponding to each row.

  • No: The compound number index. (cannot have empty cell)

Example of the standard data table format for SERRF normalization (demonstrational file is available here: https://slfan2013.github.io/SERRF-online/SERRF%20example%20dataset.xlsx)

Once you upload your data in the correct format you can click the start button and initiate the data normalization.

Uploaded data ready for normalization.

After successful data normalization, you will be notified and will be able to download the results.

Successfully normalized data will be available for download.

The results will include a report table showing the relative standard deviation for all analytes before and after applying the SERRF, your normalized dataset and a graphical report.

Congratulations, you have successfully normalized your data using SERRF!

Last updated