The Orthogonal Partial Least Squares (OPLS) uses orthogonal signal correction (OSC) to maximize the explained covariance between predictors (X) and responses (Y) on the first latent variable (LV). The orthogonal components cover all variance that is unrelated to Y.
Practical applications of OPLS-DA(examples)
The OPLS is a handy tool for lipidomics and metabolomics scientists. OPLS, similarly to PLS, produces a low-dimensional representation of large data sets for a simple investigation of differences/similarities between biological groups. Usually, a clear separation of biological groups is observed along the x-axis if the differences in lipidome or metabolome are significant enough. OPLS-DA is a good feature selection tool. The OPLS-DA for discriminant analysis (classification of samples based on metabolome/lipidome) is frequently used in the field. It can also be applied to predict clinical parameters based on lipid or metabolite concentrations.
Check out the following applications of OPLS-DA:
D. Wolrab et al. Lipidomic profiling of human serum enables detection of pancreatic cancer. DOI: (the authors of the study published in Nature Communications utilize OPLS-DA to differentiate between patients with pancreatic cancer and healthy controls; additionally, OPLS-DA is used for sample classification and then further employed to identify the most distinguishing lipids between these groups).
D. Wolrab et al. Plasma lipidomic profiles of kidney, breast and prostate cancer patients differ from healthy controls. DOI: (classification of patients with kidney, breast, and prostate cancer based on plasma lipidomes; selection of distinguishing features).
A. Kvasnička et al. Alterations in lipidome profiles distinguish early-onset hyperuricemia, gout, and the effect of urate-lowering treatment. DOI: (analysis of differences between different groups of patients and healthy controls based on plasma lipidome, sample classification, analysis of distinguishing features).
D. Olešová et al. Changes in lipid metabolism track with the progression of neurofibrillary pathology in tauopathies. DOI: (the authors use OPLS-DA to analyze differences (separation) of experimental groups, e.g., based on cerebrospinal fluid lipidome - exploring global effects of brain aberrant metabolism on the composition of CSF).
J. Idkowiak et al. Robust and high-throughput lipidomic quantitation of human blood samples using flow injection analysis with tandem mass spectrometry for clinical use. DOI: (OPLS-DA is used for the analysis of differences between patients with pancreatic cancer and healthy controls, sample classification, selection of features, and comparison of outcomes across different analytical approaches to lipidome analysis).
Training OPLS in R
The R implementation of OPLS is the ropls package. We will briefly introduce it here and prepare the OPLS-DA model to classify samples from our PDAC data set. Let's begin with the installation of ropls from the Bioconductor.
# Installation of ropls package from Bioconductor.
# Highlight and run this code:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ropls")
# Calling the library:
library(ropls)
# We will also need function from the tidymodels library:
library(tidymodels)
The opls() function fully automates the preparation of the OPLS-DA() in R. Take a look at the code below:
# First, browse the vignette of ropls package:
browseVignettes("ropls")
# Read data in R:
data <- readxl::read_xlsx(file.choose())
# Before we start with the model - we need to prepare our data.
# Adjust variable types
data$`Sample Name` <- as.character(data$`Sample Name`)
data$Label <- as.factor(data$Label)
# Remove the patients with pancreatitis (PAN):
data.no.PAN <-
data %>%
filter(Label != "PAN") %>%
droplevels()
# The opls() will expect one complete data frame.
# In opls() function, we subset a part of it for training and keep the rest for testing of OPLS-DA.
# We need to shuffle our data first:
set.seed(1111)
shuffled_data <- data.no.PAN[sample(1:nrow(data.no.PAN)),]
# However, there still may be a strong disproportion of N to T if we just cut a number of rows.
# Tidymodels solution from rsample can handle it
# We split the data into training and testing and combine them back into one tibble.
# This way - we keep good proportions of N to T for training our OPLS-DA:
set.seed(1111)
data_split <- initial_split(shuffled_data, prop = .7, strata = "Label")
train <- training(data_split)
test <- testing(data_split)
data.OPLS <- bind_rows(train,test)
# Now, we are ready to create our OPLS-DA.
# Read the documentation of the opls() function:
?opls()
# Prepare data for OPLS-DA:
X <-
data.OPLS %>%
select(-Label)
# We will name rows of X using `Sample Name` column.
# The row names are used for annotating samples in the OPLS scores plot.
X <- column_to_rownames(X, "Sample Name")
# Matrix of responses:
Y <- as.matrix(data.OPLS$Label)
# Finally, IN ONE LINE OF CODE, we build and tune OPLS-DA model:
OPLSDA <- opls(X, # Our predictors
Y, # Our responses
orthoI = NA, # By setting orthoI to NA, we force algorithm to optimize the number of orthogonal components.
subset = 1:143, # These rows are used as the OPLS-DA training set.
algoC = "nipals", # We want the NIPALS algorithm for computing our OPLS-DA
crossvalI = 7, # We use 7-fold cross-validation for testing the performance while tuning.
log10L = TRUE, # We want to log10-transform our data.
scaleC = 'standard') # We also want the auto-scaling.
After we run the last line, we obtain the following outputs from tuning:
> OPLSDA <- opls(X,
+ Y,
+ orthoI = NA,
+ subset = 1:143,
+ algoC = "nipals",
+ crossvalI = 7,
+ log10L = TRUE,
+ scaleC = 'standard')
OPLS-DA
143 samples x 127 variables and 1 response
standard scaling of predictors and response(s)
R2X(cum) R2Y(cum) Q2(cum) RMSEE RMSEP pre ort
Total 0.535 0.627 0.544 0.309 0.333 1 2
Warning messages:
1: 'permI' set to 0 because train/test partition is selected
2: OPLS: number of predictive components ('predI' argument) set to 1
And the visual output presents scores plot, loadings plot, observation diagnostics plot, and inertia barplot:
1) Scores plot (lower left corner) presents the training data set after the dimensionality reduction (we clearly see the separation of healthy controls from PDAC patients).
2) Loadings plot (lower right corner) presents the most important lipids for OPLS components; for example, the three most characteristic lipids for N are SM 39:1;O2, PC 37:2, and LPC 18:2, while for T -
Cer 34:1;O2, Cer 36:1;O2, TG 56:5.
3) Inertia barplots (upper left corner) show that one OPLS component with two orthogonal components may be enough to capture most of the inertia (variance);
4) Observation diagnostics - presents samples that can potentially bias the OPLS-DA computation.
We can now perform predictions using our OPLS-DA. The ropls function opls() creates a specific type of R object called S4 R object. To access elements of this object, we use the @ symbol. To perform predictions, we need to run the following line of code:
# Predictions.
# We can obtain the probabilities for the test set by running:
test_pred.vn <- as.numeric(OPLSDA@suppLs[["yTesMN"]])
# We will create a tibble with these results:
test_pred.vn <- as_tibble(test_pred.vn)
# And now, we add the predicted label using ifelse() function:
test_pred.vn$predicted <- ifelse(test_pred.vn$value > 0.5, "T", "N")
# Now, we need the actual labels to build the ROC curve, for example:
test_true.vi <- as_tibble(data.OPLS[144:206, "Label"]) # We extract labels by specifying row numbers in column "Label".
test_true.vi <- ifelse(test_true.vi$Label == 'N', '0', '1') # We change N to 0 and T to 1.
# Let's prepare the ROC curve for this model.
# Call library pROC:
library(pROC)
# We need to prepare a plotting space for our ROC by running:
par(pty = "s")
# Create ROC:
test.roc <- roc(test_true.vi~test_pred.vn$value)
# Plot ROC:
plot(test.roc, print.auc = TRUE)
The ROC curve for our OPLS-DA:
As you see, the performance of the OPLS-DA model prepared using ropls library is even better than the PLS-DA model we built in the previous subchapters using caret and tidymodels with mixOmics. Remember that in OPLS-DA, we focus on presenting the crucial variance in the first component while separating all the noise in the orthogonal component(s). Using caret function confusionMatrix(), we can also compute the confusion matrix:
# Confusion matrix for OPLS-DA through caret.
# Call library:
library(caret)
# Merge tibbles with predicted labels and reference labels:
test_pred.vn <-
test_pred.vn %>%
mutate(data.OPLS[144:206, "Label"])
# Obtain the confusion matrix:
confusionMatrix(as.factor(test_pred.vn$predicted), as.factor(test_pred.vn$Label))
The output:
We see that our OPLS-DA deals quite well with the classification of patient samples at a threshold of 0.5 (sensitivity of ~97%). On the other hand, the specificity is around ~85%, meaning that our OPLS-DA is a bit worse in classifying healthy control samples. The overall accuracy is 90.5% with relatively narrow 95% confidence intervals, and it is significantly above the no information rate. The model correctly distinguishes healthy individuals (controls) from patients with PDAC based on the concentrations of lipids in serum.
The low-dimensional representation of OPLS-DA was presented above - in the summary visualization with OPLS-DA fit results. In the scores plot, we saw the separation of patients with PDAC from healthy volunteers, suggesting differences between N and T groups in serum lipid profiles. Below, we will also create the loadings and VIP plots to investigate the most important lipid species contributing to the separation of controls from cancer patients - according to the OPLS-DA model. We will also show you how you can prepare S-plot.
Now, let's extract scores and loadings from the 'OPLSDA' object and prepare elegant publication-ready plots using ggplot2 function:
# Extracting scores and loadings from the OPLSDA model:
# THE OPLSDA is R - S4 object - to extract data from the object use '@', e.g.:
# Scores plot via ggplot2:
scores.tp1 <- as_tibble(OPLSDA@scoreMN) # Extract scores of component 1,
scores.to1 <- as_tibble(OPLSDA@orthoScoreMN) # Extract scores of orthogonal component,
scores.tp1$id <- row.names(scores.tp1) # Add 'id' column to scores of component 1 for merging,
scores.to1$id <- row.names(scores.to1) # Add 'id' column to scores of orthogonal component for merging,
scores <- full_join(scores.tp1, scores.to1, by = 'id') # Combine both scores into one tibble,
Label <- data.OPLS[1:143, "Label"] # Select labels from column 'Label' from rows 1 to 143.
scores$Label <- Label$Label # Add 'Label' column to scores.
scores <- mutate(scores, Label = forcats::fct_relevel(Label, "T")) # Relevel so T is target group.
# Similarly for loadings:
loadings <- as_tibble(OPLSDA@loadingMN, rownames = NA) # Keep the row names in the new tibble.
Lipids <- rownames(loadings) # Extract row names
loadings$Lipids <- Lipids # Add the column Lipids to loadings
# And orthogonal loadings:
loadings.ortho <- as_tibble(OPLSDA@orthoLoadingMN, rownames = NA)
Lipids <- rownames(loadings.ortho)
loadings.ortho$Lipids <- Lipids
# Merging loadings into one data frame:
loadings.ready <- full_join(loadings.ortho,loadings, by = "Lipids")
loadings.ready
# Variance explained by every component
OPLSDA@modelDF # The parameters of interest are stored in the first column of this table
# Create scores plot for OPLS-DA (similarly to PLS-DA) - PLOT A:
scores.plot <-
ggplot(scores, aes(x = p1, y= o1)) +
geom_point(aes(color = Label), size = 4)+
scale_color_manual(values = c('red2', 'royalblue'))+
theme_bw()+
theme(axis.text.x = element_text(size = 14,
color = 'black'),
axis.text.y = element_text(size = 14,
color = 'black'),
axis.title.x = element_text(size = 14,
color = 'black'),
axis.title.y = element_text(size = 14,
color = 'black'),
legend.text = element_text(size = 12,
color = 'black'),
legend.title = element_text(size = 12,
color = 'black'),
title = element_text(size = 14,
color = 'black'))+
labs(x = expression('T' [p1] ~ '(R2X p1 = 12.7%' * ')'),
y = expression('T' [o1] ~ '(R2X o1 = 24.1%' * ')'))+
ggtitle('OPLS-DA scores plot')
# Scores plot is stored in:
scores.plot
# Create loadings plot for OPLS-DA (similarly to PLS-DA) - PLOT B:
loadings.plot <-
ggplot(loadings.ready, aes(x = p1, y= o1)) +
geom_point(size = 3)+
ggrepel::geom_text_repel(aes(label = Lipids),
size = 4,
min.segment.length = unit(0.3, "lines"),
max.overlaps = getOption("ggrepel.max.overlaps",
default = 8)) + # You may need to adjust default.
theme_bw()+
theme(axis.text.x = element_text(size = 14,
color = 'black'),
axis.text.y = element_text(size = 14,
color = 'black'),
axis.title.x = element_text(size = 14,
color = 'black'),
axis.title.y = element_text(size = 14,
color = 'black'),
legend.text = element_text(size = 12,
color = 'black'),
legend.title = element_text(size = 12,
color = 'black'),
title = element_text(size = 14,
color = 'black'))+
labs(x = "Loadings for LV1",
y = "Loadings for oLV"
)+
ggtitle('OPLS-DA loadings')
# Loadings plot is stored in:
loadings.plot
# Merge both plots together:
ggpubr::ggarrange(scores.plot, loadings.plot)
Both plots together:
VIP plot can be prepared after extracting variable importance scores through getVipVn():
# VIP plot for OPLS-DA.
# Extracting VIP scores and storing as 'variable.importance' tibble:
variable.importance <- as_tibble(getVipVn(OPLSDA), rownames = NA)
# Adding lipid names to the 'variable.importance' tibble:
Lipids <- rownames(variable.importance)
variable.importance$Lipids <- Lipids
# Slicing head of the variable.importance tibble before plotting (35 most important lipids):
variable.importance <-
variable.importance %>%
arrange(desc(value)) %>%
slice_head(n = 35)
# Creating lollipop-VIP-chart through ggpubr library:
library(ggpubr)
ggdotchart(variable.importance, # Tibble with importance of lipids,
x = "Lipids", # Lipid names,
y = "value", # Importance scores,
sorting = "descending", # Descending sorting of importance,
color = "value", # Continuous color scaling using importance,
add = "segments", # Add grey lines to dots (lollipop chart),
rotate = TRUE, # Rotate plot (switch x and y),
dot.size = 4, # Size of dots,
ggtheme = theme_pubr(), # Plot theme,
title = 'VIP plot for OPLS-DA Controls vs PDAC \n (via ggpubr)', # Plot title. Using \n you can start a new line.
ylab = "Importance", # y-axis title,
xlab = "Lipid species", # x-axis title,
legend = "right", # legend on the right side,
legend.title = "Importance") + # Title of the legend,
scale_color_gradient(low = "blue", high = "red2")
The plot:
Finally, next to the VIP plot, the S-plot is a frequently used chart to present the most important features according to the OPLS model. S-plot presents Pearson correlations between OPLS scores and normalized lipid concentrations in relation to covariance between OPLS scores and normalized lipid concentrations.If data are Pareto-scaled, then usually a characteristic "S" appears in the middle of the chart, while autoscaling of data results in a straight line. Take a look at the code block below and the obtained S-plot:
# Preparing S-plot in R.
# Step 1 - Extract scores from the model and store them as tibble 'scores':
scores <- as_tibble(OPLSDA@scoreMN)
# Step 2 - Finish the data transformation/scaling through recipes:
library(tidymodels)
# Create a recipe
rec <- recipe(Label ~ ., data = train)
rec
# Select pre-processing methods:
recipe <-
rec %>%
step_log(all_numeric(), base = 10) %>%
step_normalize(all_numeric())
# Compute all parameters for transformation/normalization:
recipe.prepared <- prep(recipe, training = train)
recipe.prepared
# Bake the recipe - training
train.transformed <- bake(recipe.prepared, new_data = train)
train.transformed
# Relevel labels training
train.transformed <- mutate(train.transformed, Label = forcats::fct_relevel(Label, "T"))
levels(train.transformed$Label)
# Bake the recipe - testing
test.transformed <- bake(recipe.prepared, new_data = test)
test.transformed
# Relevel labels testing
test.transformed <- mutate(test.transformed, Label = forcats::fct_relevel(Label, "T"))
levels(test.transformed$Label)
test.transformed$`Sample Name` <- test$`Sample Name`
# Bind train and test-transformed data set into one matrix:
data.transformed <- bind_rows(train.transformed, test.transformed)
# Step 3 - Compute covariance between scores and normalized concentrations:
pcov <- c()
for (i in 2:128) {
cov <- cov(scores$p1, data.transformed[1:143,i])
pcov <- matrix(c(pcov, cov), ncol = 1)
}
# Change the pcov object into a tibble and add a column with lipid species:
pcov <- as_tibble(pcov)
pcov$Lipids <- colnames(data.transformed)[-c(1,129)]
pcov
# Step 4 - Compute Pearson correlations between scores and normalized concentrations:
pcorr <- c()
for (i in 2:128) {
cor <- cor(scores$p1, data.transformed[1:143,i])
pcorr <- matrix(c(pcorr, cor), ncol = 1)
}
# Change the pcorr object into a tibble and add a column with lipid species:
pcorr <- as_tibble(pcorr)
pcorr$Lipids <- colnames(data.transformed)[-c(1,129)]
pcorr
# Step 5 - Create a tibble for plotting by merging pcov and pcorr using `Lipids` column:
splot <- full_join(pcov, pcorr, by = "Lipids")
# Step 6 - Create your S-plot. We will use the interactive chart from plotly:
library(plotly)
plot_ly(x = splot$V1.x,
y = splot$V1.y,
text = splot$Lipids,
mode = "markers",
type = "scatter") %>%
layout(title = "S-plot OPLS-DA: Healthy controls vs Patients with PDAC")
The plot:
Below, you will find one more S-plot - for Pareto-scaled data and the updated S-plot code:
# Use Pareto-scaling before creating this plot!
# S-plot code (with colors according to score in the S-plot):
# Create sub-groups for colors:
splot["group"] <- "xyz"
splot[which(splot['V1.x'] < 0 & splot['V1.x'] > -0.5),"group"] <- "Insignificant"
splot[which(splot['V1.x'] > 0 & splot['V1.x'] < 0.5 ),"group"] <- "Insignificant"
splot[which(splot['V1.x'] > 0.5),"group"] <- "Upregulated in T"
splot[which(splot['V1.x'] < -0.5),"group"] <- "Downregulated in T"
# Select colors for groups:
colors <- c('grey', 'red2','royalblue')
colors <- setNames(colors, c("Insignificant", "Upregulated in T", "Downregulated in T"))
# The updated S-plot:
plot_ly(x = splot$V1.x,
y = splot$V1.y,
text = splot$Lipids,
mode = "markers",
type = "scatter",
color = splot$group,
colors = colors) %>%
layout(title = "S-plot OPLS-DA: Healthy controls vs Patients with PDAC")
The S-plot if the data set is Pareto-scaled (forming the S-like shape):
More about the possibilities offered by the ropls() package you can find in the detailed vignette prepared by the authors: