Partial Least Squares (PLS)
Metabolites and lipids multivariate statistical analysis in R
Last updated
Metabolites and lipids multivariate statistical analysis in R
Last updated
Partial least squares regression is among metabolomics and lipidomics' most frequently used statistical methods. In partial least squares, dimensionality reduction is performed at first, and next, based on the set of newly obtained predictors, called PLS components, a linear regression equation is built to perform a defined task - classification or regression. The PLS components obtained in the effect of dimensionality reduction are linear combinations of our original variables, namely lipid or metabolite concentrations.
The "predictor" is a very useful word when discussing PLS models. It refers to all variables used to predict an outcome. For instance, if we can use lipid concentrations to predict if someone has a condition, these become our predictors. We can also use PLS components for such predictions - then we will also call them predictors. The word "predictors" is used here quite frequently. Hence, the reminder.
Most likffely, you have already come (or you will soon come) across the abbreviation PLS-DA, which stands for partial least squares - discriminant analysis. It is one of the variants of partial least squares regression used for the classification of samples. PLS-DA can be trained to divide samples into biological groups based on the concentrations of metabolites or lipids measured in biological materials. Naturally, partial least squares can also be used for regression tasks, so predicting a value, e.g., BMI, based on a set of lipid concentrations measured. The difference between PLS used for regression and classification is that the first relates predictors to a continuous output (e.g., predicting BMI values), and the second to a discrete output (e.g., predicting biological group labels).
As mentioned above, a dimensionality reduction is performed in the first step of preparing the PLS-DA model. This allows for representing complex data sets as two-dimensional, easily interpretable score plot. PLS also enables the selection of the most interesting variables.
We will use two (three) excellent libraries here to work with PLS - caret and tidymodels together with mixOmics. We will mostly focus on frequently applied in lipidomics and metabolomics - PLS-DA variant. Let's begin.
Before we begin preparing the PLS-DA model in R, take a look at the following real-life examples that utilize PLS in lipidomics and metabolomics:
Y. Huang et al. Lipid profiling identifies modifiable signatures of cardiometabolic risk in children and adolescents with obesity. DOI: - Fig. 3a & b (the authors of the study published in Nature Medicine use PLS-DA to analyze differences among three age groups within normal weight (a) and overweight/obesity categories (b); the ropls library was applied).
W.C. Wang et al. Metabolomics facilitates differential diagnosis in common inherited retinal degenerations by exploring their profiles of serum metabolites. DOI: (the authors of the study published in Nature Communications apply PLS-DA to explore the difference in metabolomic profile between IRDs and the control group).
J. von Gerichten et al. Single-Cell Untargeted Lipidomics Using Liquid Chromatography and Data-Dependent Acquisition after Live Cell Selection. DOI: (PLS-DA applied to investigate differences in lipidome coverage in cell-based studies (sample prep influence), including single-cell analysis).
Y. Wang & K.A. Lê Cao. PLSDA-batch: a multivariate framework to correct for batch effects in microbiome data. DOI: (the authors of this study present a unique batch effect removal approach relying on PLS-DA, i.e., PLSDA-batch).
We will begin with the application of caret library for training our PLS-DA. The caret package can be easily merged with other solutions, e.g., from tidymodels, making the model building simple and understandable even for beginners. First, we must read our PDAC data set in R and adjust column types. It is very important that the response variables (output variables) are factors, not characters. Below, we will show you what steps and in what order should be performed to prepare your data for building the PLS-DA model. These involve:
1) Splitting the whole data set into training and testing sets (sometimes called validation sets).
A training set is used to estimate model parameters, i.e., to build an equation for classifying samples into biological groups based on lipid or metabolite concentrations.
A testing set is used to check how the model is performing, i.e., imagine a testing set as a delivery of new data to the model to check if it does a good job classifying samples into biological groups.
2) Transforming, centering, and scaling the data set.
These operations aim to decrease the influence of outliers on model performance and prevent highly abundant lipids or metabolites from dominating the model.
We will log-transform, center, and autoscale the data. The final data for creating models will have a mean value of 0 and a standard deviation of 1.
Here, the function prep() in the four-step recipe procedure for data transformation/scaling is crucial. As you see, we compute all the parameters we need for transformations using the training set only: in the prep() function, the argument training uses a 'train' data set. This is because we want to avoid using testing data sets for any procedures within the model training. Parameters computed for data transformation/scaling with the testing set included may influence the final model performance and distortion of the model validation - we could have a wrong idea about the performance of our model. This is why, in machine learning, data transformation/scaling is performed after the data is split and relies only on statistical parameters computed based on the 'train' data set.
Take a look at this code block:
We can move to model training. Here, we will need two functions from the caret package:
trainControl() - will be used to create performance metrics across resamples,
train() - will be used to find the best fit PLS-DA model to the lipidomics data.
In trainControl(), we can specify the method for testing the performance of models ('repeatedcv') and the settings of this method (arguments: number, repeats, etc.).
One of the most frequently used approaches for estimating the model performance is k-fold cross-validation. In this approach, the training data set is split into k blocks, and k-1 are used for model training, while one is held out for testing the performance. The process is repeated until every held-out block is used for testing. The final performance is averaged across k-outcomes. In the extreme case, when a single observation is held out to test the performance, we talk about Leave-One-Out Cross-Validation (LOOCV). Another variant of this method is repeated k-fold cross-validation. Here, the data set is split differently n-times into k-blocks, e.g., 5-fold cross-validation with five repetitions would result in five different ways of splitting data into five blocks. We will use repeated k-fold cross-validation to analyze the performance of our PLS-DA. Take a look at the StatQuest about k-fold cross-validation to understand this approach better:
As we face a two-class problem (healthy individuals vs patients with PDAC), we will use the area under the receiver operating characteristic (ROC) curve as a metric of PLS-DA performance. ROC curves are used to present the model's performance across the entire range of thresholds. The higher the area under the ROC curve (the closer to 1) - the better the performance of our PLS-DA. On the other hand, the closer to 0.5, the worse the performance of our PLS-DA. More about ROC curves will also be presented below, while we will analyze the performance of PLS-DA on the lipidomics testing set.
Let's change all the text from above into code and find the best fit PLS-DA model for our lipidomics data set:
Now, your best-fit model should be available. Let's check the final settings of our PLS equation. Run this simple line of code:
The plot:
The output:
If you want to check the data frame with performance metric for each resampling, run:
Let's now access the final model and extract interesting pieces of information, like, e.g., scores. Using scores, we can create a low-dimensional representation of our training data set. We expect a separation of biological groups in the low dimensional representation, suggesting differences in lipid levels between our two biological groups. If the differences are pronounced enough, we would expect from our PLS-DA a correct classification of samples from the testing set into healthy individuals and patients with PDAC.
Such analysis provides initial evidence that the set of predictors responsible for separating controls from cancer patients could be further tested regarding cancer biomarkers. However, finding predictors that differ between two conditions does not make them automatically biomarkers. Alterations in plasma/serum lipid levels are currently thoroughly tested for their potential application in supporting cancer diagnostics. It is a long process, which only begins with a discovery made using methods like PLS-DA.
Below, we will generate the low-dimensional representation of our train data set, loadings plot with weights contributing to the first two PLS components (as we hope to see a separation of N from T in the first two components), and plots representing the most "valuable lipid species from the PLS-DA point of view".
Below, you will find the code for creating scores plot:
After running this line, we obtain:
Now, we need the total variance explained:
And we obtain:
Now, let's compute the percentage of variance explained by PLS components 1 and 2, so we can add it to the scores plot:
The scores plot code using ggplot2:
The plot:
We can similarly extract loadings and plot them through the ggplot2:
Our scores plot and loading plots together:
We can check lipid species that PLS-DA relies on when distinguishing between the serum of healthy volunteers and patients with pancreatic cancer. Usually, we observe a difference in levels of these lipids between healthy individuals and patients. Hence, these lipids may have interesting biological roles in pancreatic cancer. Such lipid species can be presented using variable importance plots (VIP). We will show you three ways to obtain such plots:
The simple VIP plots:
We can also prepare an elegant ggpubr lollipop chart:
The final plot:
Finally, we will use our model for predictions, i.e., we use PLS-DA to classify samples into biological groups based on lipid concentrations measured in serum samples. In effect, we hope to see the correct classification, meaning that the model will recognize all healthy as healthy and all patients with PDAC - as patients with PDAC. This would suggest that the condition (pancreatic cancer) causes specific alterations in serum lipid profile, which are significant enough to differentiate patients with PDAC from healthy volunteers.
However, there are two important questions. What's the factor based on which we decide about the classification and its cut-off value? How do we assess the correctness of classification?
In the first case, we will ask the predict() function to create an output containing "probability-like" values ranging from 0 to 1. The default cut-off is 0.5. The class with a larger probability-like value is the class predicted by the model.
To help us assess the correctness of classification, we need to introduce additional terms, including accuracy, selectivity, and specificity.
Accuracy informs you how often the model is correct in its predictions. To compute accuracy, we can use the following formula:
TP - true positives, e.g., all correctly predicted patients with PDAC (T),
TN - true negatives, e.g., all correctly predicted healthy individuals (N),
FP - false positives, e.g., all incorrectly predicted patients with PDAC-in reality, they are healthy individuals - N,
FN - false negatives., e.g., all incorrectly predicted healthy individuals - in reality, they are patients with PDAC.
Sensitivity tells us how well our PLS-DA detects patients with PDAC (true positives = TP). It is calculated as:
Specificity in this case will inform us about the correct detection of healthy individuals (true negatives = TN):
Usually, after the model performs the prediction of class labels, the results are presented in the so-called 'confusion matrix'. The figure below presents how the confusion matrix should be interpreted:
We again encourage you to watch the StatQuest by Josh Starmer about the interpretation of confusion matrices:
Let's use our PLS-DA for predictions and create a confusion matrix using the caret function confusionMatrix().
We obtain the following output:
Based on this result - we see that from all 33 PDAC patients in the testing set - 27 were correctly assigned by the model as PDAC patients (TP), and 6 were incorrectly classified as healthy individuals (FN). Hence, we detected ~82% of PDAC cases from our test set correctly using our PLS-DA model. From all 30 healthy individuals included, 29 were correctly classified as N (TN), and 1 incorrectly as T (FP), resulting in a specificity of ~97% (we detected correctly ~97% of controls from our test set). Overall, our PLS-DA model correctly classified ~89% of all samples from the test set (56 samples), and ~11% incorrectly (7 samples).
The remaining statistics are:
Pos Pred Value is the ratio of all correctly assigned T to all assigned T, namely = 27/28.
Neg Pred Value is the ratio of all correctly assigned N to all assigned N, namely = 29/35.
Prevalence indicates how many percent of all individuals in the test set are patients with PDAC, namely = 33/63.
Detection rate is the ratio of all TP detected by PLS-DA (all correctly assigned PDAC cases by PLS-DA) to the number of all individuals in the test set, namely = 27/63.
Detection prevalence is the ratio of all detected positive cases by PLS-DA (TP +FP) to the number of all individuals in the test set, namely = 28/63.
Balanced accuracy is simply an arithmetic mean of the sensitivity and specificity values, meaning (0.8182 + 0.9667)/2
Our confusion matrix also presents 95% confidence intervals (CI) for accuracy ranging between 0.7844 and 0.9541 (a reasonable range considering 63 samples in the test set). We see that the no information rate is at 0.5238, and according to the statistical test, our accuracy is much higher than the no information rate.
If you would like to avoid presenting confusion matrices in your manuscript, the accuracy, sensitivity, and specificity can be turned into bar charts, e.g., using ggpubr or ggplot2 libraries.
Remember that the confusion matrix shows results at a specific threshold (cut-off) of 0.5 probability. However, in some cases, it may be beneficial to select a different cut-off, e.g., if we want to perform a more strict classification - we will increase the threshold, while if we consider 0.5 to be too strict, we could lower the cut-off. Analyzing every single confusion matrix obtained for different thresholds would be time-consuming. Furthermore, it is definitely not effective to present 5 or more confusion matrices in a manuscript with your metabolomics or lipidomics data. A solution here is the receiver operating characteristic (ROC) curve. After changing the threshold several times and obtaining new sensitivity and specificity, we can plot them against each other. The ROC curve is obtained if a broad range of thresholds is tested and results are plotted against each other. In summary, an ROC (Receiver Operating Characteristic) curve is used to evaluate the ability of a metabolite (or a model) to distinguish between a condition and a control by plotting sensitivity against 1-specificity at various threshold levels.
Look at StatQuest to understand how ROC curves are created and interpreted:
ROC curves are broadly used in lipidomics and metabolomics to present either classifier performance across a broad range of thresholds, e.g., (O)PLS-DA models, decision trees or random forests, or logistic regression classifiers. Alternatively, a strongly differentiating metabolite (or biomarker) can often be represented using the ROC curve, illustrating its effectiveness in distinguishing patients from healthy controls based on its concentrations in collected biological samples. If a metabolite's AUC reaches 1, its levels in all patient-derived samples are consistently higher (or lower) than those in healthy controls, achieving perfect separation between the groups.
Refer to the following sources for practical applications of ROC curves in metabolomics and lipidomics studies:
Before we create the ROC curve, it is essential to understand that:
1) The higher the area under the curve (AUC), the better our PLS-DA performs, e.g., PLS-DA leading to AUC of 0.7 is a worse classifier than the PLS-DA with an AUC of 0.95. Also, if in the future - you will test more than one type of a model for the classification of your samples, this one with the highest AUC should be your choice.
2) The closer the AUC to 0.5 - the worse the performance of our PLS-DA in general. The AUC of 0.5 tells us that our PLS-DA basically does not work, and predictions are as random as a coin flip.
3) Using ROC - you can find the best threshold. The 'optimal point' is usually defined as the one with coordinates as close to 0 as possible for the x-axis of ROC and as close to 1 as possible to the y-axis of ROC. However, any threshold corresponding to a point close to the (0,1) location may be a good choice.
Let's create the ROC curve for our PLS-DA model. Here, we will use a package called pROC.
After running the code, we obtain in the console:
And the ROC curve for our test set:
With an AUC of 0.931, we see that our PLS-DA is characterized by a good performance when it classifies samples into healthy volunteers and cancer cases. We can check also if, for the threshold of 0.5, we obtain the sample sensitivity and specificity as it was computed for our confusion matrix:
Sometimes, manuscript authors present the model's performance based on the classification of samples from the training set and testing set together. We can also prepare such ROC curve in R:
The output:
Or we can plot two ROC curves next to each other:
The output:
You can also estimate the 95% confidence intervals (95% CI) for AUC values by running these lines of code:
The output:
An alternative option to caret package is to fit the PLS-DA model to our lipidomics data by applying tidymodels collection. The purpose of tidymodels is to unify the whole workflow of model preparation. Depending on the package you select for preparing your model, this process can be entirely different, though the outcome is the same model. The tidymodels aims to change this situation. Here, we will train PLS-DA from mixOmics through a unified tidymodels pipeline.
First, we need to install the missing libraries. Except for tidymodels, we will need plsmod and we additionally install mixOmics package from Bioconductor:
We begin with tidymodels packages rsample and recipes to prepare our data set:
Now, that we have all the necessary data sets, we can tune our model:
The output:
We conclude that the model built based on three components has the highest AUC value. If you prefer to visualize this table, take a look at the code below and the output:
The plot:
We again see that the highest area under the ROC curve was obtained for three components. Hence, in the final fit, we will set num_comp to three. We set the final model and fit it:
Finally, we can use our PLS-DA model for predictions:
Let's create ROC curves for the mixOmics model trained through tidymodels:
We obtain the familiar ROC curves:
In summary, we obtained the same model as via caret library, but this time through the unified tidymodels pipeline.
What about scores or loadings plots if we use tidymodels for preparing our PLS-DA? If you would like to recover scores for scores plot, you can run the code presented below. After scores are extracted, you proceed the same as for caret package.
Essentially, we obtain the same scores plot as for caret package, but component no 2 has the opposite sign of scores:
NOTE: Remember that you can always change the sign of scores by multiplying the entire column with a component (all entries) by -1. The same applies to PCA components!
The updated PLS-DA scores:
Loadings can be extracted through:
From this step on - the procedure is the same as it was shown above for caret package.
A simple variable importance plot can be obtained using package vip:
Y. Zhao et al. NMR and MS reveal characteristic metabolome atlas and optimize esophageal squamous cell carcinoma early detection. DOI: - Fig. 6 & 7 (multiple ROC curves presented in both figures; here, e.g., the authors are comparing ROC curves relying on metabolites (potential biomarkers) for esophageal squamous cell carcinoma detection).
S. Salihovic et al. Identification and validation of a blood- based diagnostic lipidomic signature of pediatric inflammatory bowel disease. DOI: - Fig. 4B (diagnostic prediction of pediatric inflammatory bowel disease by lipidomic markers).
M. Amrutkar et al. Global metabolomic profiling of tumor tissue and paired serum samples to identify biomarkers for response to neoadjuvant FOLFIRINOX treatment of human pancreatic cancer. DOI: - Fig. 6 (comparing multiple ROC curves based on metabolites; application: identification of potential markers of treatment response in PDAC).
D. Wolrab et al. Lipidomic profiling of human serum enables detection of pancreatic cancer. DOI: - Fig. 3 e, f; Fig. 4g, h, i, j; Fig. 7a, g; (ROC curves presenting classification performance of OPLS-DA models relying on serum lipid concentrations for the detection of PDAC).
D. Wolrab et al. Plasma lipidomic profiles of kidney, breast and prostate cancer patients differ from healthy controls. DOI: Fig. 3I, J, K, L; Fig. 5 A-L; (ROC curves illustrate the classification performance of OPLS-DA models trained on plasma lipidomics data to differentiate between patients with kidney, breast, and prostate cancer and healthy controls).