💪
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  • Required packages
  • Loading the data
  • PLS-DA
  • Data normalisation
  • Label encoding
  • PLS-DA
  • Loadings
  • Making predictions with PLS-DA
  1. Metabolites and lipids multivariate statistical analysis in Python

PLS Discriminant Analysis

PreviousUniform Manifold Approximation and ProjectionNextClustered heatmaps

Last updated 7 months ago

Required packages

The required packages for this section are pandas, seaborn and scikit-learn. These can be installed with the following command in the command window (Windows) / terminal (Mac).

pip install pandas seaborn scikit-learn

Loading the data

We will again use the demo lipidomics dataset:

Load the dataset into a Pandas DataFrame named df as described in the basic plotting section:

import pandas as pd
df = pd.read_excel("Lipidomics_dataset.xlsx", decimal=",")
df.set_index("Sample Name", inplace=True)

PLS-DA

Data normalisation

The first step is to normalise the data such that the features (lipids) have zero mean and unit variance, this can easily be done with the StandardScaler from sklearn. By indexing the dataframe with df.iloc[:,1:] we'll select all the data in de dataframe, except for the first column, which contains the labels:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_normalised = scaler.fit_transform(df.iloc[:,1:])

Label encoding

The PLS algorithm was developed for continuous regression problems, but in PLS-DA we use the PLS algorithm with the response vector y being an integer representation of the group labels. To encode the labels as integers, we can use the LabelEncoder from sklearn:

from sklearn.preprocessing import LabelEncoder  

le = LabelEncoder()
encoded_labels = le.fit_transform(df.Label)

PLS-DA

Next, we'll use the PLSRegression algorithm from sklearn, and we select 2 components:

from sklearn.cross_decomposition import PLSRegression

plsda = PLSRegression(n_components=2)
plsda_features = plsda.fit_transform(df_normalised, encoded_labels)[0]

We put the results in a DataFrame, together with the group labels:

plsda_features = pd.DataFrame(data=plsda_features, 
                       columns=["Latent variable 1","Latent variable 2"],
                       index=df.index)
plsda_features["Label"] = df.Label

Now we can plot the results:

import seaborn as sns
import matplotlib.pyplot as plt

sns.scatterplot(x='Latent variable 1',
       y='Latent variable 2',
       data=plsda_features,
       hue="Label");
       
plt.show()

Loadings

The loadings of PLS-DA can be calculated as the Pearson correlation between the original centred and scaled data and the components T (x-scores):

import numpy as np

x_loadings = [np.corrcoef(df_normalised[:,y], plsda.x_scores_[:,0])[0][1] for y in range(df_normalised.shape[1])]
y_loadings = [np.corrcoef(df_normalised[:,y], plsda.x_scores_[:,1])[0][1] for y in range(df_normalised.shape[1])]

We'll put the calculated loadings in a DataFrame:

plsda_loadings = pd.DataFrame(
    data={"Latent variable 1": x_loadings, "Latent variable 2": y_loadings},
    index=df.columns[1:]
)

And to visualise the loadings:

fig, ax = plt.subplots(figsize=(15,15))

p1 = sns.scatterplot(x='Latent variable 1',
       y='Latent variable 2',
       data=plsda_loadings,
       size = 15,
       ax=ax,
       legend=False)  

for line in range(0, plsda_loadings.shape[0]):
     p1.text(plsda_loadings["Latent variable 1"][line]+0.005, plsda_loadings["Latent variable 2"][line], 
     df.columns[line], horizontalalignment='left', 
     size='small', color='black')
plt.show();

Making predictions with PLS-DA

Once a PLS model is constructed, it can be used to predict the labels of new samples (the labels are N, PAN or T in our demo dataset, but for this example we'll restrict the model to N and T samples) . Since we don't have an independent dataset, we'll split some random samples from our dataset before training the model, the training set will be used to construct the PLS-DA model, the test set to evaluate its performance. Since we are now interested getting a model that is capable of making accurate predictions and not simply in visualising the data like in the previous section, we don't have to limit ourselves to just 2 components in our model, we'll perform a cross validation experiment to find the number of components that gives the best predictive performance. These concepts are explained in more detail in the R section on PLS-DA.

First, we'll limit our dataset to N and T samples, and we'll store the labels of those samples in a separate variable:

df_normalised_T_N = df_normalised[(df.Label == "T") | (df.Label == "N")]
labels = df[(df.Label == "T") | (df.Label == "N")]["Label"]

For the PLS to work, we'll have to convert the categorical outcome variable (the labels T and N) to zeros and ones:

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
encoded_labels = le.fit_transform(labels)

Next, we'll split the dataset and labels into a training set (70% of the samples) and test set (30% of the samples). We use the stratify option to ensure that the training and test sets have the same proportion of T and N samples:

from sklearn.model_selection import train_test_split

# Split Data into Train (70%) and Test (30%)
DataTrain, DataTest, YTrain, YTest = train_test_split(
                                df_normalised_T_N, 
                                encoded_labels, 
                                test_size=3/10, 
                                stratify=encoded_labels)

Next, we'll perform a 5 fold cross-validation (cv =5) to find the optimal number of components (we'll test a range of 1 to 6)

from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import GridSearchCV

# Parameter Dictionary
param_dict = {'n_components': [1, 2, 3, 4, 5, 6]}

# initialize the hyperparameter search
plsda = PLSRegression()
grid_search = GridSearchCV(estimator=plsda,
                           param_grid=param_dict,
                           cv=5)
# Fit the model
grid_search.fit(DataTrain, YTrain)

# Print the best hyperparameters and the corresponding mean cross-validated score
print("Scores: ",grid_search.cv_results_["mean_test_score"])
print("Best Hyperparameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)

We observe the following output (the values may vary slightly due to the random nature of the splitting the samples in training and test sets):

Scores:  [0.35284577 0.45803027 0.47096677 0.39744749 0.44634731 0.41809231]
Best Hyperparameters:  {'n_components': 3}
Best Score:  0.4709667670077217

We observe that the scores increase up until 3 components, after which they start decreasing, making 3 components the best choice for this model. We now train a PLS model with 3 components on the training dataset, and we'll use this model to get predictions on the test dataset.

plsda = PLSRegression(n_components=3)
plsda_model = plsda.fit(DataTrain, YTrain)

y_pred = plsda_model.predict(DataTest)
y_pred = (y_pred > 0.5).astype('uint8')

Remember that PLS-DA is really just a PLS where we converted the categorical labels to zeros and ones, and PLS is a regression model which predicts a continuous output. We'll map all predicted values > 0.5 to 1 and <0.5 to zero:

y_pred = (y_pred > 0.5).astype('uint8')

Finally, by comparing the predicted labels with the actual labels, we can calculate the accuracy of the model and we can calculate a confusion matrix:

from sklearn.metrics import accuracy_score, confusion_matrix
print(f"Accuracy: {accuracy_score(YTest, y_pred):.2f}")
tn, fp, fn, tp = confusion_matrix(YTest, y_pred).ravel()
print(f"True negative rate: {tn/(tn+fp):.2f}")
print(f"False negative rate: {fn/(fn+tp):.2f}")
print(f"True positive rate: {tp/(tp+fn):.2f}")
print(f"False positive rate: {fp/(fp+tn):.2f}")

We observe the following output (may again be slightly different):

Accuracy: 0.85
True negative rate: 0.79
False negative rate: 0.09
True positive rate: 0.91
False positive rate: 0.21
348KB
Lipidomics_dataset.xlsx