OMICs machine learning – Examples

Elements of machine learning using OMICs data

Support vector machines (SVM)

We encourage you to watch the StatQuest tutorial by Josh Starmer to understand the general concept behind the Support Vector Machines:

Practical implementations of SVM in lipidomics and metabolomics (examples)

SVM can be applied in lipidomics and metabolomics for sample classification, feature selection, and regression.

Check out the selected manuscripts utilizing SVM in lipidomics/metabolomics:

J. Zhou et al. Metabolic detection of malignant brain gliomas through plasma lipidomic analysis and support vector machine-based machine learning. DOI: https://doi.org/10.1016/j.ebiom.2022.104097 (the authors perform metabolic detection of malignant brain gliomas through plasma lipidomic analysis and support vector machine-based machine learning).
N. Perakakis et al. Non-invasive diagnosis of non-alcoholic steatohepatitis and fibrosis with the use of omics and supervised learning: A proof of concept study. DOI: https://doi.org/10.1016/j.metabol.2019.154005 (non-invasive diagnosis of non-alcoholic steatohepatitis and fibrosis with the use of lipidomics & glycomics and supervised machine learning).
V. Scala et al. Mass Spectrometry-Based Targeted Lipidomics and Supervised Machine Learning Algorithms in Detecting Disease, Cultivar, and Treatment Biomarkers in Xylella fastidiosa subsp. pauca-Infected Olive Trees. DOI: https://doi.org/10.3389/fpls.2022.833245 (lipidomics and machine learning in plant-related research).
X. Huang et al. Integrating machine learning and nontargeted plasma lipidomics to explore lipid characteristics of premetabolic syndrome and metabolic syndrome. DOI: https://doi.org/10.3389/fendo.2024.1335269 (the authors integrate machine learning and nontargeted plasma lipidomics to explore lipid characteristics of premetabolic syndrome and metabolic syndrome).

Support vector machines in R

This R example is designed to perform a classification task using Support Vector Machine (SVM) modeling on Data set no. 1 (Example data sets - Introduction).

Here, we will use the e1071 package developed at TU Wien. Take a look at the CRAN information:

The initial steps involve preprocessing the data by removing the 'Sample Name' column to enhance model effectiveness and eliminating problematic characters from column names. Predictor variables encompass columns 2 through 128 (lipid concentrations), and the response variable ('Label' - PDAC patients (T), pancreatitis patients (PAN), or healthy controls (N)) is converted to a factor for categorical representation.

# SVM in R
# Set working directory (wd)
# e.g.: setwd("dir"), and dir = D:/Data analysis containing Data set no. 1:
setwd("D:/Data analysis")

# Install the e1071 package:
install.packages("e1071")

# Loading libraries:
library(e1071)     # SVM model preparation;
library(tidymodels) # For data log-transformation and scaling;

# Load data
dataset <- readxl::read_excel("Lipidomics_data_set.xlsx", sheet = 1)

# or
dataset <- readxl::read_xlsx(file.choose())

# Drop the 'Sample Name' column
dataset <- dataset[, -grep("Sample Name", colnames(dataset))]

# Remove problematic characters from column names
colnames(dataset) <- make.names(colnames(dataset))

# Convert response variable to factor
dataset$Label <- factor(dataset$Label)

Subsequently, the dataset is split into training and testing sets, with 80% allocated for training and 20% for testing, ensuring a randomized selection of samples. As this code can generate a different outcome every time you run it, we need to define seed for reproducibility. The data set is log-transformed and Autoscaled.

# Split the data into training and testing sets
set.seed(111)
indices <- sample(1:nrow(dataset), size = 0.8 * nrow(dataset))
train_data <- dataset[indices, ]
test_data <- dataset[-indices, ]

# We log10-transform the train and test data sets and Autoscale them (important for SVM):
Autoscaling <- function(x) {(x-mean(x))/sd(x)}

train_data_transformed <- 
  train_data %>%
  mutate_if(is.numeric, log10) %>%
  mutate_if(is.numeric, ~Autoscaling(.))

test_data_transformed <- 
  test_data %>%
  mutate_if(is.numeric, log10) %>%
  mutate_if(is.numeric, ~Autoscaling(.))
  
# We separate our response variable:
Y <- train_data_transformed$Label

One of the crucial steps in machine learning is the tune of model hyperparameters. Depending on the type of the model, different hyperparameters must be tuned. Our classifier is trained on the training data using a linear kernel. This is most effective if straight lines, planes, or hyperplanes can separate our observations. For the linear kernel, we should tune the C (cost) parameter. We can perform the tuning using tune.svm() function from the e1071 package. We tune C in a wide range of values. The smaller the C, the less likely it is to overfit the SVM model to the training data set.

# Tuning C-parameter for the final SVM model:
# The sampling method is 10-fold cross-validation - it performs random split.
# We again need to define the seed first:
set.seed(111)

# Next, we run the SVM tune:
tuned.parameters <- 
  tune.svm(x = train_data_transformed[,-1], y = as.factor(Y), kernel = "linear",
         cost = c(0.001,0.01,0.1,1,10,100,1000))

# We create a summary of the tuning:
summary(tuned.parameters)

From the tune run, we conclude that the C parameter should be set to 0.1 in our final model:

The core of the script involves the creation of an SVM classifier using the svm() function. In the final SVM model, we set the C parameter from the best tuning (here, it is 0.1). Predictions are then made on the test set, and the script proceeds to evaluate the SVM model's accuracy. This evaluation is performed by comparing the predicted labels to the actual labels in the test data and computing the accuracy as the proportion of correctly classified instances. Additionally, a classification report is generated using the confusion matrix, and the results are stored for further analysis or reporting.

# Create a final SVM classifier:
svm_model <- svm(Label ~ ., 
                 data = train_data_transformed, 
                 kernel = "linear", 
                 cost = 0.1)

# Make predictions on the test set:
svm_predictions <- predict(svm_model, test_data_transformed)

# Evaluate the SVM model:
svm_accuracy <- sum(svm_predictions == test_data_transformed$Label) / nrow(test_data_transformed)
svm_classification_report <- table(svm_predictions, test_data_transformed$Label)
cat("SVM Accuracy:", svm_accuracy, "\n")
print("SVM Classification Report:")
print(svm_classification_report)

And we obtain:

Our SVM performance can be further evaluated through a more advanced confusion matrix (e.g., from the caret package) or based on the ROC curve (pROC package).

Support vector machines in Python

The provided code showcases the implementation of a Support Vector Machine (SVM) classifier for a classification task on Data set no. 1 (Example data sets - Introduction). The code begins by importing the necessary libraries, including Pandas, scikit-learn's SVM module (SVC), and related modules for data preprocessing and evaluation.

In this example, after loading the data, the independent variables ('X') and the dependent variable ('y') are defined. Feature standardization is performed using StandardScaler, which is crucial for SVM algorithms. The data is then split into training and testing sets in a stratified way such that the different sample groups are equally represented in the training and test sets. For a linear SVM, the C parameter, which controls the regularisation of the model, should be tuned. (a higher C has a higher risk of creating an overfitted model). We use a gridsearch for a range of different C values:

import pandas as pd
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler

# Load data
df = pd.read_excel("Lipidomics_dataset.xlsx", decimal=",")
df.set_index("Sample Name", inplace=True)
X = df.iloc[:, 1:128]
y = df['Label']

# Standardize the features (important for SVM)
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

# Parameter Dictionary
param_dict = {'C': [0.01, 0.1, 0.2, 0.4, 0.8, 1, 2, 4, 6, 8, 10]}

# initialize the hyperparameter search
clf = SVC(kernel='linear', random_state=42)
grid_search = GridSearchCV(estimator=clf,
                           param_grid=param_dict,
                           cv=5)
# Fit the model
grid_search.fit(X_train, y_train)

# Print the best hyperparameters and the corresponding mean cross-validated score
print("Scores: ",grid_search.cv_results_["mean_test_score"])
print("Best Hyperparameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)

We get the following output and observe that the model reaches a maximum accuracy on the training dataset for C=0.01:

Scores:  [0.82983871 0.82298387 0.81673387 0.78528226 0.77903226 0.77903226
 0.77903226 0.77903226 0.77903226 0.77903226 0.77903226]
Best Hyperparameters:  {'C': 0.01}
Best Score:  0.8298387096774194

We now create an SVM classifier with the regularisation parameter 'C' set to 0.01. The classifier is trained on the standardized training data, and predictions are made on the test set. Model performance is evaluated using accuracy and a classification report, and the results are printed at the end of the script:

# Create an SVM classifier
clf = SVC(kernel='linear', C=0.01)

# Train the classifier
clf.fit(X_train, y_train)

# Make predictions on the test set
predictions = clf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, predictions)
classification_report_result = classification_report(y_test, predictions)

# Print results
print(f'Accuracy: {accuracy:.2f}')
print('Classification Report:')
print(classification_report_result)

We observe the following output:

Accuracy: 0.86
Classification Report:
              precision    recall  f1-score   support

           N       0.87      0.90      0.89        30
         PAN       1.00      0.50      0.67         6
           T       0.83      0.88      0.85        33

    accuracy                           0.86        69
   macro avg       0.90      0.76      0.80        69
weighted avg       0.86      0.86      0.85        69

Random Forest (RF)

We encourage you to watch the StatQuest tutorials by Josh Starmer to understand the general concept behind the Decision and Classification Trees and Random Forest:

Practical implementations of RF in lipidomics and metabolomics (examples)

Random forest is applied in lipidomics and metabolomics, e.g., to impute missing values (as previously mentioned), manage batch-related effects, select key-important lipids or metabolites (for classification), and for sample classification and regression.

Take a look at the selected applications below as real-life examples:

Y. Chen et al. Metabolomic machine learning predictor for diagnosis and prognosis of gastric cancer. DOI: https://doi.org/10.1038/s41467-024-46043-y (gastric cancer diagnostic model built using a random forest algorithm with LASSO feature selection)
P.F. Garrido et al. Lipidomics signature in post-COVID patient sera and its influence on the prolonged inflammatory response. DOI: https://doi.org/10.1016/j.jiph.2024.01.017 (random forest model successfully differentiated between symptomatic and asymptomatic post-COVID conditions groups using lipidomic profiles)
S. G. Snowden et al. Combining lipidomics and machine learning to measure clinical lipids in dried blood spots. DOI: https://doi.org/10.1007/s11306-020-01703-0 - (the authors of the manuscript published in Metabolomics (Springer-Nature) combine lipidomics and machine learning (random forest) to measure 'clinical lipids', i.e. HDL, LDL, TG - in dried blood spots).
T. Deng et al. Lipidomics random forest algorithm of seminal plasma is a promising method for enhancing the diagnosis of necrozoospermia. DOI: https://doi.org/10.1007/s11306-024-02118-x (the authors evaluate the efficacy of a lipidomics-based random forest algorithm model in identifying necrozoospermia).
X. Huang et al. Integrating machine learning and nontargeted plasma lipidomics to explore lipid characteristics of premetabolic syndrome and metabolic syndrome. DOI: https://doi.org/10.3389/fendo.2024.1335269 (the authors integrate machine learning and nontargeted plasma lipidomics to explore lipid characteristics of premetabolic syndrome and metabolic syndrome).
S. Fan et al. Systematic Error Removal Using Random Forest for Normalizing Large-Scale Untargeted Lipidomics Data. DOI: https://doi.org/10.1021/acs.analchem.8b05592 (the authors use random forest for batch effect removal).

Random Forest using R

This R script demonstrates the implementation of a Random Forest classifier for a classification task using the same PDAC dataset (Data set no. 1 from the examples - Introduction to the GitBook). Here, we will rely on the randomForest R package. You can learn more about it here:

If you are particularly interested in Random Forest models, read the article published by the authors of the package, too:

We begin by installing and loading the required libraries, namely 'randomForest' for the Random Forest algorithm and 'readxl' for reading Excel files.

# Random Forest in R
# Set working directory (wd)
# e.g.: setwd("dir"), and dir = D:/Data analysis containing Data set no. 1:
setwd("D:/Data analysis")

# Install randomForest library:
install.packages("randomForest")

# Load required libraries
library(randomForest)
library(caret)
library(readxl)

The dataset is then loaded from an Excel file, and the 'Sample Name' column is dropped to enhance the model's performance. To address potential issues with column names, problematic characters are removed, and the response variable ('Label') is converted to a factor.

# Load data
dataset <- readxl::read_excel("Lipidomics_data_set.xlsx", sheet = 1)
# or
dataset <- readxl::read_xlsx(file.choose())

# Drop the 'Sample Name' column
dataset <- dataset[, -grep("Sample Name", colnames(dataset))]

# Remove problematic characters from column names
colnames(dataset) <- make.names(colnames(dataset))

# Convert response variable to factor
dataset$Label <- factor(dataset$Label)

The data is split into training and testing sets using a seed for reproducibility:

# Split the data into training and testing sets
set.seed(111)  # Set a seed for reproducibility
indices <- sample(1:nrow(dataset), size = 0.8 * nrow(dataset))
train_data <- dataset[indices, ]
test_data <- dataset[-indices, ]

Using the caret package, we will tune two hyperparameters for the RF model: the number of trees (ntree) and the number of features (lipids) considered at each split (mtry). If we check the information about the RF model through the modelLookup() function from the caret package, we will learn which parameters can be tuned through the tunegrid argument in the train() (also caret). This argument allows for simple, automated tuning.

# Checking information about rf model:
modelLookup(model = "rf")

We obtain:

We can tune the mtry parameter through the tunegrid argument. As it requires a data frame as input, we create one through expand.grid(). In this case, the second hyperparameter can be tuned using the for loop. From the output produced by the train() function, we will extract the accuracy. This will be used to select the best mtry and ntree. The results from the for loop will be stored as a data frame named 'tuning_result'. Here is the code:

# As a control method we select repeated 10-fold CV:
control <- trainControl(method = 'repeatedcv',
                        number = 10,
                        repeats = 3,
                        search = 'grid')
                        
# We create tunegrid (a data frame with our tuning parameters for train()):
tunegrid <- expand.grid(mtry=c(1:15))

# Create a data frame for CV-obtained accuracies:
tuning_result <- data.frame()

# We create for loop for tuning:
for (ntree in c(100,200,300,400,500,1000,1500)){   # We define for which ntree values the CV should be performed
  set.seed(111)                        # Seed for reproducibility
  fit <- train(Label ~ .,              # Define response (Label) ~ predictor variables (all lipid concentrations)
               data = train_data,      # Data set for training
               method = 'rf',          # Model type
               metric = 'Accuracy',    # Select metric
               tuneGrid = tunegrid,    # Deliver parameters, which can be tuned through tuneGrid (here mtry only)
               trControl = control,    # Tuning of hyperparameters based on the trainControl output (defined above as control)
               ntree = ntree)          # ntree - for every value in c()
  tuning_result <- rbind(tuning_result, fit$results$Accuracy) # Store results in a data frame
}

# We rename the columns and rows of the newly obtained data frame:
colnames(tuning_result) <- paste0("mtry",rep(1:15,1))    # Columns according to mtry
rownames(tuning_result) <- c("100", "200", "300", "400", "500", "1000", "1500") # Rows according to the number of trees

NOTE: Be patient! These computations can take some time.

Now, we are looking for maximum accuracy in the 'tuning_result' data frame. We can use:

# Finding maximum accuracy:
max(tuning_result)

The maximum accuracy is 0.7865898, which corresponds to mtry of 10 and ntree of 200, according to the 'tuning_result' table. Next, we build a Random Forest classifier with mtry of 10 and ntree of 200, and the model is trained on the training data. We also extract the proximity matrix (proximity = TRUE). The proximity, which is considered a measure of (dis)similarity, will be used to create a multi-dimensional scaling plot (MDS plot).

# Create a Random Forest classifier
set.seed(111) # For reproducibility
rf_model <- randomForest(Label ~ ., 
                         data = train_data, 
                         mtry = 10, 
                         ntree = 200, 
                         importance = TRUE,
                         proximity = TRUE)
                         
# Call the model:
rf_model
                         
# Create the MDS plot:
MDSplot(rf_model,                          # Here, we indicate our model.
        fac = train_data$Label,            # Here, our response variable.
        palette = c("royalblue", "orange", "red2"))   # Colors for each biological group.

If we call the rf_model and MDS plot, we obtain two outputs:

Finally, predictions are made on the test set, and the example evaluates the model's accuracy by comparing predictions to the actual labels. Additionally, a classification report is generated using the confusion matrix. Finally, the example prints the Random Forest accuracy and the classification report for further analysis.

# Make predictions on the test set
rf_predictions <- predict(rf_model, test_data)

# Evaluate the Random Forest model
rf_accuracy <- sum(rf_predictions == test_data$Label) / nrow(test_data)
rf_classification_report <- table(rf_predictions, test_data$Label)
cat("Random Forest Accuracy:", rf_accuracy, "\n")
print("Random Forest Classification Report:")
print(rf_classification_report)

We obtain:

Random Forest using Python

The example utilizes various libraries to perform a classification task using the Random Forest algorithm on Data set no. 1 (Example data sets - Introduction).

The code begins by importing necessary Python libraries, such as NumPy and Pandas, along with specific modules from the scikit-learn library:

import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import RandomizedSearchCV

Next, we load the data into a DataFrame named df, using Pandas. Subsequently, the independent variables (features) are assigned to 'X,' and the dependent variable (target) is assigned to 'y.' The data is then split into training and testing sets using the train_test_split function. 80% of the data is used for training (X_train and y_train), while the remaining 20% is used for testing (X_test and y_test).

# Load data
df = pd.read_excel("Lipidomics_dataset.xlsx", decimal=",")
df.set_index("Sample Name", inplace=True)
X = df.iloc[:, 1:128]
y = df['Label']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

In contrast to an SVM with a linear kernel (which only required us to tune 1 model parameter "C"), for a Random Forest classifier, there are many model hyperparameters that require tuning (e.g., the depth of the trees, the number of trees, minimum number of samples required to split a node in a tree, etc.). In this example, we will tune 6 different model hyperparameters, each one for a range of different values, resulting in millions of different combinatorial options for setting the Random Forest model hyperparameters. Since it is no longer computationally feasible to test all these combinations, we will use a random search algorithm (RandomizedSearchCV from sklearn) that will randomly test a subset of all the given combinations of hyperparameter settings. While there is no guarantee that we'll find the absolute best settings for our model this way, in practice, this does work well enough.

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 10, stop = 500, num = 10)]

# Number of features to consider at every split
max_features = ['log2', 'sqrt', None]

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]

# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestClassifier()

# Random search of parameters, using 3 fold cross validation, 
# search across 300 different combinations, and use all available cores
rf_random = RandomizedSearchCV(
               estimator = rf,
               param_distributions = random_grid,
               n_iter = 300,
               cv = 3,
               n_jobs = -1)

# Fit the random search model
rf_random.fit(X_train, y_train)

rf_random.best_params_

And we observe the following output (which will differ to some extent on each run, given the random nature of the search)

{'n_estimators': 175,
 'min_samples_split': 5,
 'min_samples_leaf': 1,
 'max_features': 'log2',
 'max_depth': 20,
 'bootstrap': False}

Next, a Random Forest classifier is created with the optimized settings of the parameters:

# Create a Random Forest classifier
clf = RandomForestClassifier(
    n_estimators=175,
    min_samples_split= 5,
    min_samples_leaf= 1,
    max_features= 'log2',
    max_depth= 20,
    bootstrap= False
)

The classifier is trained on the training data using the fit method, and predictions are made on the test set using the trained classifier:

# Train the classifier
clf.fit(X_train, y_train)
# Make predictions on the test set
predictions = clf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, predictions)
classification_report_result = classification_report(y_test, predictions)

print(f'Accuracy: {accuracy:.2f}')
print('Classification Report:')
print(classification_report_result)

We obtain the following output:

Accuracy: 0.83
Classification Report:
              precision    recall  f1-score   support

           N       0.78      0.90      0.84        20
         PAN       0.50      0.25      0.33         4
           T       0.90      0.86      0.88        22

    accuracy                           0.83        46
   macro avg       0.73      0.67      0.68        46
weighted avg       0.82      0.83      0.82        46

PreviousApplication of selected models to OMICs data NextLibrary versions

Last updated 4 months ago