💪
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  • Practical applications of PLS-DA (examples)
  • Training PLS-DA model using caret library
  • Fundamentals of Receiver Operating Characteristic (ROC) curves. Application of ROCs to evaluate PLS-DA performance.
  • Training PLS-DA model using tidymodels library
  1. Metabolites and lipids multivariate statistical analysis in R

Partial Least Squares (PLS)

Metabolites and lipids multivariate statistical analysis in R

PreviousUniform Manifold Approximation and Projection (UMAP)NextOrthogonal Partial Least Squares (OPLS)

Last updated 2 months ago

Partial least squares regression is among metabolomics and lipidomics' most frequently used statistical methods. In partial least squares, dimensionality reduction is performed at first, and next, based on the set of newly obtained predictors, called PLS components, a linear regression equation is built to perform a defined task - classification or regression. The PLS components obtained in the effect of dimensionality reduction are linear combinations of our original variables, namely lipid or metabolite concentrations.

The "predictor" is a very useful word when discussing PLS models. It refers to all variables used to predict an outcome. For instance, if we can use lipid concentrations to predict if someone has a condition, these become our predictors. We can also use PLS components for such predictions - then we will also call them predictors. The word "predictors" is used here quite frequently. Hence, the reminder.

Most likffely, you have already come (or you will soon come) across the abbreviation PLS-DA, which stands for partial least squares - discriminant analysis. It is one of the variants of partial least squares regression used for the classification of samples. PLS-DA can be trained to divide samples into biological groups based on the concentrations of metabolites or lipids measured in biological materials. Naturally, partial least squares can also be used for regression tasks, so predicting a value, e.g., BMI, based on a set of lipid concentrations measured. The difference between PLS used for regression and classification is that the first relates predictors to a continuous output (e.g., predicting BMI values), and the second to a discrete output (e.g., predicting biological group labels).

As mentioned above, a dimensionality reduction is performed in the first step of preparing the PLS-DA model. This allows for representing complex data sets as two-dimensional, easily interpretable score plot. PLS also enables the selection of the most interesting variables.

We will use two (three) excellent libraries here to work with PLS - caret and tidymodels together with mixOmics. We will mostly focus on frequently applied in lipidomics and metabolomics - PLS-DA variant. Let's begin.

Practical applications of PLS-DA (examples)

Before we begin preparing the PLS-DA model in R, take a look at the following real-life examples that utilize PLS in lipidomics and metabolomics:

  • Y. Huang et al. Lipid profiling identifies modifiable signatures of cardiometabolic risk in children and adolescents with obesity. DOI: - Fig. 3a & b (the authors of the study published in Nature Medicine use PLS-DA to analyze differences among three age groups within normal weight (a) and overweight/obesity categories (b); the ropls library was applied).

  • W.C. Wang et al. Metabolomics facilitates differential diagnosis in common inherited retinal degenerations by exploring their profiles of serum metabolites. DOI: (the authors of the study published in Nature Communications apply PLS-DA to explore the difference in metabolomic profile between IRDs and the control group).

  • J. von Gerichten et al. Single-Cell Untargeted Lipidomics Using Liquid Chromatography and Data-Dependent Acquisition after Live Cell Selection. DOI: (PLS-DA applied to investigate differences in lipidome coverage in cell-based studies (sample prep influence), including single-cell analysis).

  • Y. Wang & K.A. Lê Cao. PLSDA-batch: a multivariate framework to correct for batch effects in microbiome data. DOI: (the authors of this study present a unique batch effect removal approach relying on PLS-DA, i.e., PLSDA-batch).

Training PLS-DA model using caret library

We will begin with the application of caret library for training our PLS-DA. The caret package can be easily merged with other solutions, e.g., from tidymodels, making the model building simple and understandable even for beginners. First, we must read our PDAC data set in R and adjust column types. It is very important that the response variables (output variables) are factors, not characters. Below, we will show you what steps and in what order should be performed to prepare your data for building the PLS-DA model. These involve:

1) Splitting the whole data set into training and testing sets (sometimes called validation sets).

  • A training set is used to estimate model parameters, i.e., to build an equation for classifying samples into biological groups based on lipid or metabolite concentrations.

  • A testing set is used to check how the model is performing, i.e., imagine a testing set as a delivery of new data to the model to check if it does a good job classifying samples into biological groups.

2) Transforming, centering, and scaling the data set.

  • These operations aim to decrease the influence of outliers on model performance and prevent highly abundant lipids or metabolites from dominating the model.

  • We will log-transform, center, and autoscale the data. The final data for creating models will have a mean value of 0 and a standard deviation of 1.

  • Here, the function prep() in the four-step recipe procedure for data transformation/scaling is crucial. As you see, we compute all the parameters we need for transformations using the training set only: in the prep() function, the argument training uses a 'train' data set. This is because we want to avoid using testing data sets for any procedures within the model training. Parameters computed for data transformation/scaling with the testing set included may influence the final model performance and distortion of the model validation - we could have a wrong idea about the performance of our model. This is why, in machine learning, data transformation/scaling is performed after the data is split and relies only on statistical parameters computed based on the 'train' data set.

Take a look at this code block:

# Training PLS-DA model.
# Installing libraries:
install.packages('caret')

# Calling libraries:
library(tidymodels)
library(caret)

# Read the data in R. Adjusting column types:
data <- readxl::read_xlsx(file.choose())
data$`Sample Name` <- as.character(data$`Sample Name`)
data$Label <- as.factor(data$Label)

# Recheck the structure of data:
glimpse(data)

# Here, we won't need the column `Sample Names` - we remove it:
data.no.ID <- 
  data %>% 
  select(-`Sample Name`)
  
# We remove the patients with PAN and focus on the differences between N & T:
data.no.ID.N.T <- 
  data.no.ID %>% 
  filter(Label != "PAN") %>% 
  droplevels()
  
# Our data are tidy, and we can proceed to the data split.
# We put aside 30% of our data set that will be used to validate the PLS-DA model.
# This set must not be used to train the PLS model or normalize data.
# To make this fully reproducible, we set a specific seed value:
set.seed(1111)

# Now, we split our data set. 
# We use initial_split(), training(), and testing() functions from rsample (tidymodels):
data_split <- initial_split(data.no.ID.N.T, prop = .7, strata = "Label")

# We indicate: 
# 1) the data set we want to split,
# 2) the proportion - 0.7 - training vs 0.3 testing set,
# 3) and a variable for stratified sampling - our biological group column `Label`.

# Now, we can extract our training set:
train <- training(data_split)
 
# And testing set, respectively:
test  <- testing(data_split)

# You can recheck the data structure after the split:
str(train)
str(test)

# Now, it is time to normalize the data set.
# First, we compute normalization parameters using train data set and normalize it.
# Next, we normalize our test data set using the parameters from train data set.
# Here, we will rely on recipes we showed you before.
# Return to the Data Transformation, Scaling, and Normalization chapter for more details.

# Data transformation
# Create a recipe
rec <- recipe(Label ~ ., data = train)
rec

# Prepare the recipe:
recipe <- 
  rec %>% 
  step_log(all_numeric()) %>% 
  step_normalize(all_numeric())
recipe

# Compute parameters needed for normalization:
recipe.prepared <- prep(recipe, training = train)
recipe.prepared

# Bake the recipe for train and test data sets:
train.transformed <- bake(recipe.prepared, new_data = train)
train.transformed

test.transformed <- bake(recipe.prepared, new_data = test)
test.transformed

# Check the levels of your factor column - Label:
levels(train.transformed$Label)
levels(test.transformed$Label)

# We want our condition - T to be the main factor. 
# If that's not the case, relevel the Label column!
train.transformed <- mutate(train.transformed, Label = forcats::fct_relevel(Label, "T"))
levels(train.transformed$Label) 

test.transformed <- mutate(test.transformed, Label = forcats::fct_relevel(Label, "T"))
levels(test.transformed$Label)

# Quick glimpse at the transformed data set distributions:
GGally::ggpairs(train.transformed)
GGally::ggpairs(test.transformed)

# Now, our data are ready to build the PLS-DA model.

We can move to model training. Here, we will need two functions from the caret package:

  • trainControl() - will be used to create performance metrics across resamples,

  • train() - will be used to find the best fit PLS-DA model to the lipidomics data.

In trainControl(), we can specify the method for testing the performance of models ('repeatedcv') and the settings of this method (arguments: number, repeats, etc.).

One of the most frequently used approaches for estimating the model performance is k-fold cross-validation. In this approach, the training data set is split into k blocks, and k-1 are used for model training, while one is held out for testing the performance. The process is repeated until every held-out block is used for testing. The final performance is averaged across k-outcomes. In the extreme case, when a single observation is held out to test the performance, we talk about Leave-One-Out Cross-Validation (LOOCV). Another variant of this method is repeated k-fold cross-validation. Here, the data set is split differently n-times into k-blocks, e.g., 5-fold cross-validation with five repetitions would result in five different ways of splitting data into five blocks. We will use repeated k-fold cross-validation to analyze the performance of our PLS-DA. Take a look at the StatQuest about k-fold cross-validation to understand this approach better:

As we face a two-class problem (healthy individuals vs patients with PDAC), we will use the area under the receiver operating characteristic (ROC) curve as a metric of PLS-DA performance. ROC curves are used to present the model's performance across the entire range of thresholds. The higher the area under the ROC curve (the closer to 1) - the better the performance of our PLS-DA. On the other hand, the closer to 0.5, the worse the performance of our PLS-DA. More about ROC curves will also be presented below, while we will analyze the performance of PLS-DA on the lipidomics testing set.

Let's change all the text from above into code and find the best fit PLS-DA model for our lipidomics data set:

# Training, tuning, and selecting the best PLS-DA model with caret.
# First, we set the basic parameters for evaluating every PLS-DA.
# Calling the caret library:
library(caret)

# Reading the documentation:
?trainControl()
?train()

# Selecting the parameters for evaluation of every model:
control <- trainControl(method = "repeatedcv",    # We use repeated k-fold cross-validation.
                        number = 7,             # As the number of fold we set 7 (7 blocks ~ 20 samples/block)
                        repeats = 10,           # We want to split our data into 7 blocks in 10 different ways.
                        classProbs = T,         # We want to compute class probabilities.
                        summaryFunction = twoClassSummary)  # We set the function to estimate performance across resamples.
                        
# We need to create X and Y for training our model. 
# Look for explanations in comments to train() function.
X <- 
  train.transformed %>% 
  select(-Label)
  
Y <- train.transformed$Label

# Fitting our model over different tuning parameters:
PLSDA <- train(x = X,            # Our predictors - lipids. Can be matrix or data frame. Samples must be in rows, variables in columns with column names!
               y = Y,            # Outcome variable: N or T. Must be numeric or factor vector.
               method = 'pls',   # We want PLS algorithm.
               tuneLength = 10,   # We will tune the PLS across 10 PLS components included in the final equation.
               trControl = control,   # Model performance based on trainControl() - from above.
               metric = 'ROC')       # The model with the highest area under the ROC curve will be selected.

Now, your best-fit model should be available. Let's check the final settings of our PLS equation. Run this simple line of code:

# Checking the PLS-DA tuning. 
# ROC depending on the number of components in the equation.
plot(PLSDA, main = 'PLS-DA model tuning', xlab = 'Number of components')

The plot:

# Exact values can be accessed through:
PLSDA$results
# Best tune = number of components in the final PLS-DA equation:
PLSDA$bestTune

The output:

> PLSDA$bestTune
  ncomp
3     3

If you want to check the data frame with performance metric for each resampling, run:

# Data frame with performance metric for each resample:
PLSDA$resample

Let's now access the final model and extract interesting pieces of information, like, e.g., scores. Using scores, we can create a low-dimensional representation of our training data set. We expect a separation of biological groups in the low dimensional representation, suggesting differences in lipid levels between our two biological groups. If the differences are pronounced enough, we would expect from our PLS-DA a correct classification of samples from the testing set into healthy individuals and patients with PDAC.

Such analysis provides initial evidence that the set of predictors responsible for separating controls from cancer patients could be further tested regarding cancer biomarkers. However, finding predictors that differ between two conditions does not make them automatically biomarkers. Alterations in plasma/serum lipid levels are currently thoroughly tested for their potential application in supporting cancer diagnostics. It is a long process, which only begins with a discovery made using methods like PLS-DA.

Below, we will generate the low-dimensional representation of our train data set, loadings plot with weights contributing to the first two PLS components (as we hope to see a separation of N from T in the first two components), and plots representing the most "valuable lipid species from the PLS-DA point of view".

Below, you will find the code for creating scores plot:

# Extracting PLS scores:
PLSDA$finalModel$scores

# We turn them into a data frame so the scores plot can be prepared:
scores <- as.matrix.data.frame(PLSDA$finalModel$scores)
scores <- as.data.frame(scores)

# We will add the column with biological groups - so we can color N and T:
scores$Label <- train.transformed$Label

# We will also compute the variance explained by the first two components.
# To access the variance explained by first two components run:
PLSDA$finalModel$Xvar

After running this line, we obtain:

> PLSDA$finalModel$Xvar
  Comp 1   Comp 2   Comp 3 
4375.906 3586.850 2185.271 

Now, we need the total variance explained:

# Extracting total variance explained:
PLSDA$finalModel$Xtotvar

And we obtain:

> PLSDA$finalModel$Xtotvar
[1] 18034

Now, let's compute the percentage of variance explained by PLS components 1 and 2, so we can add it to the scores plot:

# Variance explained by Component 1 = 4376, Component 2 = 3587, Total variance = 18034
# Relative variance explained by Component 1 
(4376/18034)*100 # in [%]
# Relative variance explained by Component 2
(3587/18034)*100 # in [%]

The scores plot code using ggplot2:

# Scores plot from ggplot2:
ggplot(scores, aes(x = V1, y= V2, color = Label)) + 
    geom_point(size = 3)+
  scale_color_manual(values = c('red2', 'royalblue'))+
  theme_bw()+
  theme(axis.text.x = element_text(size = 12, 
                                   color = 'black'),
        axis.text.y = element_text(size = 12, 
                                   color = 'black'),
        legend.text = element_text(size = 12,
                                   color = 'black'),
        legend.title = element_text(size = 12, 
                                    color = 'black'))+
  labs(x = "Component 1 (var explained = 24.3%)",
       y = "Component 2 (var explained = 19.9%)"
       )+
  ggtitle('PLS-DA scores plot')
  

# It is a simple geom_point-based plot. There is nothing new in this code for you!

The plot:

We can similarly extract loadings and plot them through the ggplot2:

# Loadings plot
# Extract loadings from the caret model
loadings <- as.matrix.data.frame(PLSDA$finalModel$loadings)
loadings <- as.data.frame(loadings)

# Adding lipid names to the loading plot
loadings$Lipids <- colnames(train.transformed %>% select(-Label))

# The loadings plot code:
loadings.weights.plot <- ggplot(loadings, aes(x = V1, y= V2)) + 
  geom_point(size = 3)+
  ggrepel::geom_text_repel(aes(label = Lipids), 
                           size = 3.5, 
                           min.segment.length = unit(0.3, "lines"), 
                           max.overlaps = getOption("ggrepel.max.overlaps", 
                                                    default = 10)) +
  theme_bw()+
  theme(axis.text.x = element_text(size = 12, 
                                   color = 'black'),
        axis.text.y = element_text(size = 12, 
                                   color = 'black'),
        legend.text = element_text(size = 12,
                                   color = 'black'),
        legend.title = element_text(size = 12, 
                                    color = 'black'))+
  labs(x = "Loading weights for Component 1",
       y = "Loadings weights for Component 2"
  )+
  ggtitle('PLS-DA loading weights')

loadings.weights.plot

# Here, there is one new element.
# We want to annotate points corresponding to lipids.
# To achieve this, we will need geom_text_repel() layer from the ggrepel library.
# Install this library - if you do not have it yet!
# Using geom_text_repel layer:
# 1) we add new aesthetics - label corresponding to lipid names,
# 2) we can specify size, presence of connectors, and maximum.overlaps.

Our scores plot and loading plots together:

We can check lipid species that PLS-DA relies on when distinguishing between the serum of healthy volunteers and patients with pancreatic cancer. Usually, we observe a difference in levels of these lipids between healthy individuals and patients. Hence, these lipids may have interesting biological roles in pancreatic cancer. Such lipid species can be presented using variable importance plots (VIP). We will show you three ways to obtain such plots:

# VIP plot
## Option 1: Simple code

 ## Simple bar chart 
    ggplot(varImp(PLSDA), top = 35) +
      geom_col(fill = 'red2') +
    theme_bw() +
    ggtitle('VIP plot for PLS-DA Controls vs PDAC')

  ## Simple lollipop chart
    plot(varImp(PLSDA), top = 30, main = 'VIP plot for PLS-DA Controls vs PDAC')

The simple VIP plots:

We can also prepare an elegant ggpubr lollipop chart:

# VIP plot
# Option 2: ggpubr package.
# Call library:
library(ggpubr)

# Prepare 30 top VIP scores scaled as 100.
# Separate lipid importance from the PLSDA model.
var.imp <- 
  varImp(PLSDA)$importance %>% 
  as.data.frame()
  
# Add column with lipid names:
var.imp$Lipids <- rownames(var.imp)

# Arrange the importance in descending order:
var.imp <- 
  var.imp %>% 
  arrange(desc(Overall))
  
# Slice head to var.imp tibble with 30 most important lipids for PLS-DA components:
var.imp <- 
  var.imp %>% 
  slice_head(n = 30)
  
# Create ggpubr chart:
ggdotchart(var.imp,                            # Tibble with importance of lipids,
           x = "Lipids",                       # Lipid names,
           y = "Overall",                      # Value of importance,
           sorting = "descending",             # Descending sorting of importance,
           color = "Overall",                  # Continuous color scaling using importance,
           add = "segments",                   # Add grey lines to dots (lollipop chart),         
           rotate = TRUE,                      # Rotate plot (switch x and y),          
           dot.size = 4,                       # Size of dots,                    
           ggtheme = theme_pubr(),             # Plot theme,
           title = 'VIP plot for PLS-DA Controls vs PDAC \n (via ggpubr)',     # Plot title. Using \n you can start a new line.
           ylab = "Importance",       # y-axis title,
           xlab = "Lipid species",    # x-axis title,
           legend = "right",          # legend on the right side,
           legend.title = "Importance") +       # Title of the legend,
  scale_color_gradient(low = "blue", high = "red2")       # Continuous scaling of color with low values in blue and high in red2.

The final plot:

Finally, we will use our model for predictions, i.e., we use PLS-DA to classify samples into biological groups based on lipid concentrations measured in serum samples. In effect, we hope to see the correct classification, meaning that the model will recognize all healthy as healthy and all patients with PDAC - as patients with PDAC. This would suggest that the condition (pancreatic cancer) causes specific alterations in serum lipid profile, which are significant enough to differentiate patients with PDAC from healthy volunteers.

However, there are two important questions. What's the factor based on which we decide about the classification and its cut-off value? How do we assess the correctness of classification?

In the first case, we will ask the predict() function to create an output containing "probability-like" values ranging from 0 to 1. The default cut-off is 0.5. The class with a larger probability-like value is the class predicted by the model.

To help us assess the correctness of classification, we need to introduce additional terms, including accuracy, selectivity, and specificity.

Accuracy informs you how often the model is correct in its predictions. To compute accuracy, we can use the following formula:

ACC=NumberoftruepredictionsNumberofallpredictions=TP+TNTP+TN+FP+FNACC = \frac {Number \hspace{1 mm}of\hspace{1 mm} true\hspace{1 mm} predictions}{Number\hspace{1 mm}of\hspace{1 mm}all\hspace{1 mm}predictions}= \frac{TP + TN}{TP+TN+FP+FN}ACC=NumberofallpredictionsNumberoftruepredictions​=TP+TN+FP+FNTP+TN​
  • TP - true positives, e.g., all correctly predicted patients with PDAC (T),

  • TN - true negatives, e.g., all correctly predicted healthy individuals (N),

  • FP - false positives, e.g., all incorrectly predicted patients with PDAC-in reality, they are healthy individuals - N,

  • FN - false negatives., e.g., all incorrectly predicted healthy individuals - in reality, they are patients with PDAC.

Sensitivity tells us how well our PLS-DA detects patients with PDAC (true positives = TP). It is calculated as:

Sensitivity=TPTP+FNSensitivity = \frac{TP}{TP+FN}Sensitivity=TP+FNTP​

Specificity in this case will inform us about the correct detection of healthy individuals (true negatives = TN):

Specificity=TNTN+FPSpecificity = \frac{TN}{TN+FP}Specificity=TN+FPTN​

Usually, after the model performs the prediction of class labels, the results are presented in the so-called 'confusion matrix'. The figure below presents how the confusion matrix should be interpreted:

We again encourage you to watch the StatQuest by Josh Starmer about the interpretation of confusion matrices:

Let's use our PLS-DA for predictions and create a confusion matrix using the caret function confusionMatrix().

# Creating a confusion matrix through caret package.
# Step 1 - model matrix for your test set.
# Here, we will design a specific matrix that can be used for predictions.
x.test <- model.matrix(Label~.-1, test.transformed)

# Also, prepare such a matrix for the train set:
x.train <- model.matrix(Label~.-1, train.transformed)

# model.matrix() adds column 'intercept'.
# The -1 behind "." is used to remove it.

# Let's make our predictions and create a confusion matrix in one line:
confusionMatrix(predict(PLSDA$finalModel,x.test), as.factor(test.transformed$Label))

# In the first part: predict(PLSDA$finalModel.x.test) - we use PLS-DA for predicting Label.
# In the second part: as.factor(test.transformed$Label) - we deliver true biological groups. 

We obtain the following output:

Based on this result - we see that from all 33 PDAC patients in the testing set - 27 were correctly assigned by the model as PDAC patients (TP), and 6 were incorrectly classified as healthy individuals (FN). Hence, we detected ~82% of PDAC cases from our test set correctly using our PLS-DA model. From all 30 healthy individuals included, 29 were correctly classified as N (TN), and 1 incorrectly as T (FP), resulting in a specificity of ~97% (we detected correctly ~97% of controls from our test set). Overall, our PLS-DA model correctly classified ~89% of all samples from the test set (56 samples), and ~11% incorrectly (7 samples).

The remaining statistics are:

  • Pos Pred Value is the ratio of all correctly assigned T to all assigned T, namely = 27/28.

  • Neg Pred Value is the ratio of all correctly assigned N to all assigned N, namely = 29/35.

  • Prevalence indicates how many percent of all individuals in the test set are patients with PDAC, namely = 33/63.

  • Detection rate is the ratio of all TP detected by PLS-DA (all correctly assigned PDAC cases by PLS-DA) to the number of all individuals in the test set, namely = 27/63.

  • Detection prevalence is the ratio of all detected positive cases by PLS-DA (TP +FP) to the number of all individuals in the test set, namely = 28/63.

  • Balanced accuracy is simply an arithmetic mean of the sensitivity and specificity values, meaning (0.8182 + 0.9667)/2

Our confusion matrix also presents 95% confidence intervals (CI) for accuracy ranging between 0.7844 and 0.9541 (a reasonable range considering 63 samples in the test set). We see that the no information rate is at 0.5238, and according to the statistical test, our accuracy is much higher than the no information rate.

If you would like to avoid presenting confusion matrices in your manuscript, the accuracy, sensitivity, and specificity can be turned into bar charts, e.g., using ggpubr or ggplot2 libraries.

Fundamentals of Receiver Operating Characteristic (ROC) curves. Application of ROCs to evaluate PLS-DA performance.

Remember that the confusion matrix shows results at a specific threshold (cut-off) of 0.5 probability. However, in some cases, it may be beneficial to select a different cut-off, e.g., if we want to perform a more strict classification - we will increase the threshold, while if we consider 0.5 to be too strict, we could lower the cut-off. Analyzing every single confusion matrix obtained for different thresholds would be time-consuming. Furthermore, it is definitely not effective to present 5 or more confusion matrices in a manuscript with your metabolomics or lipidomics data. A solution here is the receiver operating characteristic (ROC) curve. After changing the threshold several times and obtaining new sensitivity and specificity, we can plot them against each other. The ROC curve is obtained if a broad range of thresholds is tested and results are plotted against each other. In summary, an ROC (Receiver Operating Characteristic) curve is used to evaluate the ability of a metabolite (or a model) to distinguish between a condition and a control by plotting sensitivity against 1-specificity at various threshold levels.

Look at StatQuest to understand how ROC curves are created and interpreted:

ROC curves are broadly used in lipidomics and metabolomics to present either classifier performance across a broad range of thresholds, e.g., (O)PLS-DA models, decision trees or random forests, or logistic regression classifiers. Alternatively, a strongly differentiating metabolite (or biomarker) can often be represented using the ROC curve, illustrating its effectiveness in distinguishing patients from healthy controls based on its concentrations in collected biological samples. If a metabolite's AUC reaches 1, its levels in all patient-derived samples are consistently higher (or lower) than those in healthy controls, achieving perfect separation between the groups.

Refer to the following sources for practical applications of ROC curves in metabolomics and lipidomics studies:

Before we create the ROC curve, it is essential to understand that:

1) The higher the area under the curve (AUC), the better our PLS-DA performs, e.g., PLS-DA leading to AUC of 0.7 is a worse classifier than the PLS-DA with an AUC of 0.95. Also, if in the future - you will test more than one type of a model for the classification of your samples, this one with the highest AUC should be your choice.

2) The closer the AUC to 0.5 - the worse the performance of our PLS-DA in general. The AUC of 0.5 tells us that our PLS-DA basically does not work, and predictions are as random as a coin flip.

3) Using ROC - you can find the best threshold. The 'optimal point' is usually defined as the one with coordinates as close to 0 as possible for the x-axis of ROC and as close to 1 as possible to the y-axis of ROC. However, any threshold corresponding to a point close to the (0,1) location may be a good choice.

Let's create the ROC curve for our PLS-DA model. Here, we will use a package called pROC.

# Creating the ROC curve in R with the pROC package.
# Installation of the pROC package:
install.packages("pROC")

# Calling the pROC library:
library(pROC)

# Now, we need our predictions with probability values:
test_prob <- predict(PLSDA$finalModel, x.test, type = 'prob')
test_prob <- as_tibble(test_prob)

# We create a tibble with probability values.
# And separate the vector with probabilities of interest:
vector.test.prob <- test_prob$`T.3 comps` # Here, low probability means N and high T.

# We need to prepare a plotting space for our ROC by running:
par(pty="s")

# ROC code:
roc(test.transformed$Label ~ vector.test.prob, # Actual biological groups vs. model predictions.
    plot = TRUE,                               # We need the ROC plot.
    print.auc = TRUE,                          # We want the AUC in the plot.
    legacy.axes = TRUE,                       # Most ROC are presented with increasing specificity on x-axis, i.e., from 0 to 1. Hence, 1 - specificity.
    levels = c("N", "T"))                    # Levels of responses - first controls, then cases.

After running the code, we obtain in the console:

> roc(test.transformed$Label ~ vector.test.prob, 
+     plot = TRUE, 
+     print.auc = TRUE, 
+     legacy.axes = TRUE, 
+     levels = c("N", "T"))
Setting direction: controls < cases

Call:
roc.formula(formula = test.transformed$Label ~ vector.test.prob,     plot = TRUE, print.auc = TRUE, legacy.axes = T, levels = c("N",         "T"))

Data: vector.test.prob in 30 controls (test.transformed$Label N) < 33 cases (test.transformed$Label T).
Area under the curve: 0.9313

And the ROC curve for our test set:

With an AUC of 0.931, we see that our PLS-DA is characterized by a good performance when it classifies samples into healthy volunteers and cancer cases. We can check also if, for the threshold of 0.5, we obtain the sample sensitivity and specificity as it was computed for our confusion matrix:

# Checking the ROC at threshold 0.5.
# We store the ROC as 'test_roc':
test_roc <- 
roc(test.transformed$Label ~ vector.test.prob, 
    plot = TRUE, 
    print.auc = TRUE, 
    legacy.axes = TRUE, 
    levels = c("N", "T"))

# Check all thresholds:
test_roc$thresholds

# The threshold of 0.5 corresponds to entry no. 36
# Now, we can check the sensitivity and specificity for entry no. 36:
test_roc$sensitivities
test_roc$specificities

# In the output we find: sensitivity - 0.81818182, specificity - 0.96666667.
# These values are the same as in the confusion matrix.

Sometimes, manuscript authors present the model's performance based on the classification of samples from the training set and testing set together. We can also prepare such ROC curve in R:

# ROC curves for PLS-DA performance on training and testing set:
## Both ROC curves plotted together in one x-y plane 
      # Prepare plotting space 
      par(pty="s")

      # Prepare probabilities for plotting
      # Test set
      test_prob <- predict(PLSDA$finalModel, x.test, type = 'prob')
      test_prob <- as_tibble(test_prob)
      vector.test.prob <- test_prob$`T.3 comps`
      test_roc <- roc(test.transformed$Label ~ vector.test.prob, plot = TRUE, print.auc = TRUE, legacy.axes=T, levels=c("N", "T"))

      # Train set
      train_prob <- predict(PLSDA$finalModel, x.train, type = "prob")
      train_prob <- as_tibble(train_prob)
      vector.train.prob <- train_prob$`T.3 comps`
      train_roc <- roc(train.transformed$Label ~ vector.train.prob, plot = TRUE, print.auc = TRUE, legacy.axes=T, levels=c("N", "T"))

      # Plot both ROC curves in one x-y plane.
      # We use add = TRUE, and specify the title (main) and AUC positions:
      plot(train_roc, col = 'darkblue', main = "ROC - Controls vs PDAC", print.auc=T, legacy.axes=T)
      plot(test_roc, col = 'royalblue', add = TRUE, print.auc=TRUE, print.auc.y=.4, legacy.axes=T)

The output:

Or we can plot two ROC curves next to each other:

# ROC curves for PLS-DA performance on training and testing set.
 ## Two ROCs next to each other
      # Prepare plotting space
      par(mfrow = c(1, 2))
      
      # ROC curves
      plot(train_roc, col = 'darkblue', main = "ROC Training Set", print.auc=T, print.auc.x = 0.6, cex.main = 1)
      plot(test_roc, col = 'royalblue', main = "ROC Testing Set", print.auc=TRUE, print.auc.x = 0.6, cex.main = 1) 

The output:

You can also estimate the 95% confidence intervals (95% CI) for AUC values by running these lines of code:

# 95% confidence intervals (CI) for AUC
# Train set
set.seed(1234)
ci.auc(train_roc, method="bootstrap")
# Test set
set.seed(1234)
ci.auc(test_roc, method="bootstrap")

The output:

> # 95% confidence intervals (CI) for AUC
> # Train set
> set.seed(1234)
> ci.auc(train_roc, method="bootstrap")
95% CI: 0.9666-0.9973 (2000 stratified bootstrap replicates)
> # Test set
> set.seed(1234)
> ci.auc(test_roc, method="bootstrap")
95% CI: 0.8555-0.9848 (2000 stratified bootstrap replicates)

Training PLS-DA model using tidymodels library

An alternative option to caret package is to fit the PLS-DA model to our lipidomics data by applying tidymodels collection. The purpose of tidymodels is to unify the whole workflow of model preparation. Depending on the package you select for preparing your model, this process can be entirely different, though the outcome is the same model. The tidymodels aims to change this situation. Here, we will train PLS-DA from mixOmics through a unified tidymodels pipeline.

First, we need to install the missing libraries. Except for tidymodels, we will need plsmod and we additionally install mixOmics package from Bioconductor:

# Installation of the plsmod library
install.packages("plsmod")

# The mixOmics package is a Bioconductor package and cannot be installed through CRAN.
# Highlight the entire code and run it:
  
  if (!require("remotes", quietly = TRUE)) {
    install.packages("remotes")
  }
  
  remotes::install_bioc("mixOmics")

# Calling libraries:
library(tidymodels)
library(plsmod)

We begin with tidymodels packages rsample and recipes to prepare our data set:

# Data preparation for PLS-DA with tidymodels.
# Step 1 - we read our data in R:
data <- readxl::read_xlsx(file.choose())

# Step 2 - we adjust variable types:
data$`Sample Name` <- as.character(data$`Sample Name`)
data$Label <- as.factor(data$Label)

# Step 3 - we drop the unnecessary `Sample Name` column:
data.no.ID <- 
  data %>% 
  select(-`Sample Name`)
  
# Step 4 - we remove all patients with pancreatitis (PAN group):
data.no.ID.N.T <- 
  data.no.ID %>% 
  filter(Label != "PAN") %>% 
  droplevels()
  
# Step 5 - we split our data into train and test data sets (rsample):
set.seed(1111)
data_split <- initial_split(data.no.ID.N.T, prop = .7, strata = "Label")
train <- training(data_split)
test  <- testing(data_split)

# Recheck data post-split
str(train)
summary(train)
str(test)
summary(test)

# Step 6 - the log-transformation of our data set and normalization (recipes):
PDAC_rec <- 
  recipe(Label ~ ., data = train) %>%
  step_log(all_numeric()) %>%
  step_normalize(all_numeric())
  
# Within the same step, we prepare train.transformed & test.transformed:
PDAC_rec_prep <- prep(PDAC_rec, training = train)
train.transformed <- bake(PDAC_rec_prep, new_data = train)
test.transformed <- bake(PDAC_rec_prep, new_data = test)

# We will need these data sets after tuning our PLS-DA model.
  
# Step 7 - we split our data for k-fold cross-validation (rsample):
set.seed(1111)
PDAC_folds <-  vfold_cv(train, v = 7)

Now, that we have all the necessary data sets, we can tune our model:

# Tuning PLS-DA with parsnip and tune packages:
PLSDA.tidymodels <- 
  pls(num_comp = tune()) %>%          # Here, we define the model - PLS, and we set num_comp to tune() - we will tune this parameter.
  set_mode("classification") %>%     # We select the classification mode (we want PLS-DA), but if the outcome variable is continuous go for "regression".
  set_engine("mixOmics")           # The engine is naturally mixOmics package.
  
# We create a tibble with a range of num_comp (PLS components) for tuning:
grid <- tibble(num_comp = seq(from = 1, to = 10, by = 1))

# Tuning of PLS-DA model:
Tuning <- 
  workflow() %>% 
  add_recipe(PDAC_rec) %>%
  add_model(PLSDA.tidymodels) %>%
  tune_grid(resamples = PDAC_folds, grid = grid)
  
# To see the summary of the number of components tuning, we run:
Tuning.summary <- collect_metrics(Tuning)

The output:

We conclude that the model built based on three components has the highest AUC value. If you prefer to visualize this table, take a look at the code below and the output:

# Visualizing tuning results:
ggplot(Tuning.summary, 
       aes(x = num_comp, 
           y = mean, 
           color = .metric)) +
  geom_point(aes(shape = .metric), 
             size = 3, 
             show.legend = F) +
  geom_line(aes(group = .metric)) +
  scale_x_continuous(breaks = seq(from = 1, to = 10, by = 1)) + 
  ggsci::scale_color_startrek() +
  scale_shape_manual(values = c(18,19,20)) +
  theme_bw() +
  labs(title = "PLS-DA tuning via tidymodels",
       subtitle = "Tune of number of components",
       x = "Number of components",
       y = "Accuracy or AUC value",
       color = "Metric")

The plot:

We again see that the highest area under the ROC curve was obtained for three components. Hence, in the final fit, we will set num_comp to three. We set the final model and fit it:

# Setting final PLS-DA model:
PLSDA.final <- 
  parsnip::pls(num_comp = 3) %>%
  set_engine("mixOmics") %>%
  set_mode("classification")
  
# Model fit:
PLSDA.fit <- 
  PLSDA.final %>%  
  fit(Label~., data = train.transformed)

Finally, we can use our PLS-DA model for predictions:

# Predicting training set:
predict(PLSDA.fit, new_data = train.transformed, type = "prob")
# Predicting testing set:
predict(PLSDA.fit, new_data = test.transformed, type = "prob")

Let's create ROC curves for the mixOmics model trained through tidymodels:

# Calling pROC library:
library(pROC)

# Prepare probabilities for plotting
# Test set
test_prob <- predict(PLSDA.fit, test.transformed, type = 'prob')
test_prob <- as_tibble(test_prob)
vector.test.prob <- test_prob$.pred_T    
test_roc <- roc(test.transformed$Label ~ vector.test.prob, plot = F, print.auc = TRUE, legacy.axes=T, levels=c("N", "T"))

# Train set
train_prob <- predict(PLSDA.fit, train.transformed, type = "prob")
train_prob <- as_tibble(train_prob)
vector.train.prob <- train_prob$.pred_T
train_roc <- roc(train.transformed$Label ~ vector.train.prob, plot = F, print.auc = TRUE, legacy.axes=T, levels=c("N", "T"))

## Two ROCs next to each other
# Prepare plotting space
par(mfrow = c(1, 2))

# ROC curves
plot(train_roc, col = 'darkblue', main = "ROC Training Set", print.auc = T, print.auc.x = 0.6, cex.main = 1)
plot(test_roc, col = 'royalblue', main = "ROC Testing Set", print.auc = T, print.auc.x = 0.6, cex.main = 1) 

We obtain the familiar ROC curves:

In summary, we obtained the same model as via caret library, but this time through the unified tidymodels pipeline.

What about scores or loadings plots if we use tidymodels for preparing our PLS-DA? If you would like to recover scores for scores plot, you can run the code presented below. After scores are extracted, you proceed the same as for caret package.

# Recovering scores for scores plot:
scores <- as_tibble(PLSDA.fit$fit$variates$X)

# Adding `Label` column to color points in the scores plot:
scores$Label <- train.transformed$Label

# Checking types of columns
str(scores)

# Relevel levels of factors in scores if necessary - T set as target factor:
scores <- mutate(scores, Label = forcats::fct_relevel(Label, "T"))

# Scores plot
scores.plot <- ggplot(scores, aes(x = comp1, y= comp2, color = Label)) + 
  geom_point(size = 3)+
  scale_color_manual(values = c('red2', 'royalblue'))+
  theme_bw()+
  theme(axis.text.x = element_text(size = 12, 
                                   color = 'black'),
        axis.text.y = element_text(size = 12, 
                                   color = 'black'),
        legend.text = element_text(size = 12,
                                   color = 'black'),
        legend.title = element_text(size = 12, 
                                    color = 'black'))+
  labs(x = "Component 1 (var explained = 24.3%)",
       y = "Component 2 (var explained = 19.9%)"
  )+
  ggtitle('PLS-DA scores plot')
scores.plot

Essentially, we obtain the same scores plot as for caret package, but component no 2 has the opposite sign of scores:

NOTE: Remember that you can always change the sign of scores by multiplying the entire column with a component (all entries) by -1. The same applies to PCA components!

# Changing the sign of component no 2:
ggplot(scores, aes(x = comp1, y= -1*(comp2), color = Label)) + 
  geom_point(size = 3)+
  scale_color_manual(values = c('red2', 'royalblue'))+
  theme_bw()+
  theme(axis.text.x = element_text(size = 12, 
                                   color = 'black'),
        axis.text.y = element_text(size = 12, 
                                   color = 'black'),
        legend.text = element_text(size = 12,
                                   color = 'black'),
        legend.title = element_text(size = 12, 
                                    color = 'black'))+
  labs(x = "Component 1 (var explained = 24.3%)",
       y = "Component 2 (var explained = 19.9%)"
  )+
  ggtitle('PLS-DA scores plot')

The updated PLS-DA scores:

Loadings can be extracted through:

# Extracting loadings from the tidymodels PLS-DA model:
PLSDA.fit$fit$loadings$X

# Or if you want to store loadings for plotting:
loadings <- as_tibble(PLSDA.fit$fit$loadings$X)

From this step on - the procedure is the same as it was shown above for caret package.

A simple variable importance plot can be obtained using package vip:

# Install package vip:
install.packages(vip)

# Call vip library:
library(vip)

# Create the VIP plot:
vip(PLSDA.fit$fit, num_features = 35)

Y. Zhao et al. NMR and MS reveal characteristic metabolome atlas and optimize esophageal squamous cell carcinoma early detection. DOI: - Fig. 6 & 7 (multiple ROC curves presented in both figures; here, e.g., the authors are comparing ROC curves relying on metabolites (potential biomarkers) for esophageal squamous cell carcinoma detection).

S. Salihovic et al. Identification and validation of a blood- based diagnostic lipidomic signature of pediatric inflammatory bowel disease. DOI: - Fig. 4B (diagnostic prediction of pediatric inflammatory bowel disease by lipidomic markers).

M. Amrutkar et al. Global metabolomic profiling of tumor tissue and paired serum samples to identify biomarkers for response to neoadjuvant FOLFIRINOX treatment of human pancreatic cancer. DOI: - Fig. 6 (comparing multiple ROC curves based on metabolites; application: identification of potential markers of treatment response in PDAC).

D. Wolrab et al. Lipidomic profiling of human serum enables detection of pancreatic cancer. DOI: - Fig. 3 e, f; Fig. 4g, h, i, j; Fig. 7a, g; (ROC curves presenting classification performance of OPLS-DA models relying on serum lipid concentrations for the detection of PDAC).

D. Wolrab et al. Plasma lipidomic profiles of kidney, breast and prostate cancer patients differ from healthy controls. DOI: Fig. 3I, J, K, L; Fig. 5 A-L; (ROC curves illustrate the classification performance of OPLS-DA models trained on plasma lipidomics data to differentiate between patients with kidney, breast, and prostate cancer and healthy controls).

https://doi.org/10.1038/s41591-024-03279-x
https://doi.org/10.1038/s41467-024-47911-3
https://doi.org/10.1021/acs.analchem.3c05677
https://doi.org/10.1093/bib/bbac622
https://doi.org/10.1038/s41467-024-46837-0
https://doi.org/10.1038/s41467-024-48763-7
https://doi.org/10.1002/1878-0261.13759
https://doi.org/10.1038/s41467-021-27765-9
https://doi.org/10.1038/s41598-021-99586-1
StatQuest by Josh Starmer presenting the k-fold cross-validation.
Statquest concerning confusion matrices - by Josh Starmer.
StatQuest presenting ROC and AUC concepts - by Josh Starmer.
PLS-DA model teaching. The highest area under the ROC curve was found for equation containing three PLS components.
The area under the ROC curve, sensitivity, and specificity depending on the number of PLS components in the equation. The final fit will contain three PLS components.
PLS-DA scores plot showing the separation of pancreatic cancer patients from healthy volunteers based on alterations in serum lipidome.
Right chart - PLS-DA scores plot showing the separation of pancreatic cancer patients from healthy volunteers based on alterations in serum lipidome. Left chart - PLS loadings (weights), presenting lipid species with largest contribution to PLS components (bottom left corner of plot - long chain SM species SM 39:1;O2, 41:1;O2, 42:1;O2, LPC species, PC O-/P- species, and PC species; and top right corner - TG species, MG species, and Cer species).
Simple VIP plots from Option 1: bar chart (on the left), lollipop chart (on the right). Naturally, the higher the importance, the more significant the lipid for PLS-DA components. LPC, Cer, PC, SM, and PC O-/P- species are among the most significant lipids.
The lollipop chart presents the importance of variables for our PLS-DA. The color of the dots has continuous scaling from red2 (high importance of variables) to blue (low importance of variables).
The interpretation of the confusion matrix created after the model assigned labels to biological samples.
The confusion matrix generated by the caret package function - confusionMatrix().
ROC curve obtained using the pROC package for the test set.
ROC curves preseting PLS-DA performance for the classification of N and T on the train set (dark blue), and test set (light blue).
Two ROC curves are plotted next to each other. The left one presents PLS-DA performance for the classification of N and T on the train set (dark blue), and the right one - on the test set (light blue).
Summary of PLS-DA parameters tuning (here - number of components).
The ggplot2 presenting results of PLS-DA model tuning (tidymodels package).
ROC curves obtained for the mixOmics PLS-DA model trained using tidymodels pipeline.
PLS-DA scores plot obtained through tidymodels.
PLS-DA scores plot obtained through tidymodels - after multiplying column with component 2 by -1.