OMICs machine learning – Examples
Elements of machine learning using OMICs data
Last updated
Elements of machine learning using OMICs data
Last updated
We encourage you to watch the StatQuest tutorial by Josh Starmer to understand the general concept behind the Support Vector Machines:
SVM can be applied in lipidomics and metabolomics for sample classification, feature selection, and regression.
Check out the selected manuscripts utilizing SVM in lipidomics/metabolomics:
This R example is designed to perform a classification task using Support Vector Machine (SVM) modeling on Data set no. 1 (Example data sets - Introduction).
Here, we will use the e1071 package developed at TU Wien. Take a look at the CRAN information:
The initial steps involve preprocessing the data by removing the 'Sample Name' column to enhance model effectiveness and eliminating problematic characters from column names. Predictor variables encompass columns 2 through 128 (lipid concentrations), and the response variable ('Label' - PDAC patients (T), pancreatitis patients (PAN), or healthy controls (N)) is converted to a factor for categorical representation.
Subsequently, the dataset is split into training and testing sets, with 80% allocated for training and 20% for testing, ensuring a randomized selection of samples. As this code can generate a different outcome every time you run it, we need to define seed for reproducibility. The data set is log-transformed and Autoscaled.
One of the crucial steps in machine learning is the tune of model hyperparameters. Depending on the type of the model, different hyperparameters must be tuned. Our classifier is trained on the training data using a linear kernel. This is most effective if straight lines, planes, or hyperplanes can separate our observations. For the linear kernel, we should tune the C (cost) parameter. We can perform the tuning using tune.svm() function from the e1071 package. We tune C in a wide range of values. The smaller the C, the less likely it is to overfit the SVM model to the training data set.
From the tune run, we conclude that the C parameter should be set to 0.1 in our final model:
The core of the script involves the creation of an SVM classifier using the svm() function. In the final SVM model, we set the C parameter from the best tuning (here, it is 0.1). Predictions are then made on the test set, and the script proceeds to evaluate the SVM model's accuracy. This evaluation is performed by comparing the predicted labels to the actual labels in the test data and computing the accuracy as the proportion of correctly classified instances. Additionally, a classification report is generated using the confusion matrix, and the results are stored for further analysis or reporting.
And we obtain:
Our SVM performance can be further evaluated through a more advanced confusion matrix (e.g., from the caret package) or based on the ROC curve (pROC package).
The provided code showcases the implementation of a Support Vector Machine (SVM) classifier for a classification task on Data set no. 1 (Example data sets - Introduction). The code begins by importing the necessary libraries, including Pandas, scikit-learn's SVM module (SVC), and related modules for data preprocessing and evaluation.
In this example, after loading the data, the independent variables ('X') and the dependent variable ('y') are defined. Feature standardization is performed using StandardScaler, which is crucial for SVM algorithms. The data is then split into training and testing sets in a stratified way such that the different sample groups are equally represented in the training and test sets. For a linear SVM, the C parameter, which controls the regularisation of the model, should be tuned. (a higher C has a higher risk of creating an overfitted model). We use a gridsearch for a range of different C values:
We get the following output and observe that the model reaches a maximum accuracy on the training dataset for C=0.01:
We now create an SVM classifier with the regularisation parameter 'C' set to 0.01. The classifier is trained on the standardized training data, and predictions are made on the test set. Model performance is evaluated using accuracy and a classification report, and the results are printed at the end of the script:
We observe the following output:
We encourage you to watch the StatQuest tutorials by Josh Starmer to understand the general concept behind the Decision and Classification Trees and Random Forest:
Random forest is applied in lipidomics and metabolomics, e.g., to impute missing values (as previously mentioned), manage batch-related effects, select key-important lipids or metabolites (for classification), and for sample classification and regression.
Take a look at the selected applications below as real-life examples:
This R script demonstrates the implementation of a Random Forest classifier for a classification task using the same PDAC dataset (Data set no. 1 from the examples - Introduction to the GitBook). Here, we will rely on the randomForest R package. You can learn more about it here:
If you are particularly interested in Random Forest models, read the article published by the authors of the package, too:
We begin by installing and loading the required libraries, namely 'randomForest' for the Random Forest algorithm and 'readxl' for reading Excel files.
The dataset is then loaded from an Excel file, and the 'Sample Name' column is dropped to enhance the model's performance. To address potential issues with column names, problematic characters are removed, and the response variable ('Label') is converted to a factor.
The data is split into training and testing sets using a seed for reproducibility:
Using the caret package, we will tune two hyperparameters for the RF model: the number of trees (ntree) and the number of features (lipids) considered at each split (mtry). If we check the information about the RF model through the modelLookup() function from the caret package, we will learn which parameters can be tuned through the tunegrid argument in the train() (also caret). This argument allows for simple, automated tuning.
We obtain:
We can tune the mtry parameter through the tunegrid argument. As it requires a data frame as input, we create one through expand.grid(). In this case, the second hyperparameter can be tuned using the for loop. From the output produced by the train() function, we will extract the accuracy. This will be used to select the best mtry and ntree. The results from the for loop will be stored as a data frame named 'tuning_result'. Here is the code:
NOTE: Be patient! These computations can take some time.
Now, we are looking for maximum accuracy in the 'tuning_result' data frame. We can use:
The maximum accuracy is 0.7865898, which corresponds to mtry of 10 and ntree of 200, according to the 'tuning_result' table. Next, we build a Random Forest classifier with mtry of 10 and ntree of 200, and the model is trained on the training data. We also extract the proximity matrix (proximity = TRUE). The proximity, which is considered a measure of (dis)similarity, will be used to create a multi-dimensional scaling plot (MDS plot).
If we call the rf_model and MDS plot, we obtain two outputs:
Finally, predictions are made on the test set, and the example evaluates the model's accuracy by comparing predictions to the actual labels. Additionally, a classification report is generated using the confusion matrix. Finally, the example prints the Random Forest accuracy and the classification report for further analysis.
We obtain:
The example utilizes various libraries to perform a classification task using the Random Forest algorithm on Data set no. 1 (Example data sets - Introduction).
The code begins by importing necessary Python libraries, such as NumPy and Pandas, along with specific modules from the scikit-learn library:
Next, we load the data into a DataFrame named df, using Pandas. Subsequently, the independent variables (features) are assigned to 'X,' and the dependent variable (target) is assigned to 'y.' The data is then split into training and testing sets using the train_test_split function. 80% of the data is used for training (X_train and y_train), while the remaining 20% is used for testing (X_test and y_test).
In contrast to an SVM with a linear kernel (which only required us to tune 1 model parameter "C"), for a Random Forest classifier, there are many model hyperparameters that require tuning (e.g., the depth of the trees, the number of trees, minimum number of samples required to split a node in a tree, etc.). In this example, we will tune 6 different model hyperparameters, each one for a range of different values, resulting in millions of different combinatorial options for setting the Random Forest model hyperparameters. Since it is no longer computationally feasible to test all these combinations, we will use a random search algorithm (RandomizedSearchCV from sklearn) that will randomly test a subset of all the given combinations of hyperparameter settings. While there is no guarantee that we'll find the absolute best settings for our model this way, in practice, this does work well enough.
And we observe the following output (which will differ to some extent on each run, given the random nature of the search)
Next, a Random Forest classifier is created with the optimized settings of the parameters:
The classifier is trained on the training data using the fit method, and predictions are made on the test set using the trained classifier:
We obtain the following output:
J. Zhou et al. Metabolic detection of malignant brain gliomas through plasma lipidomic analysis and support vector machine-based machine learning. DOI: (the authors perform metabolic detection of malignant brain gliomas through plasma lipidomic analysis and support vector machine-based machine learning).
N. Perakakis et al. Non-invasive diagnosis of non-alcoholic steatohepatitis and fibrosis with the use of omics and supervised learning: A proof of concept study. DOI: (non-invasive diagnosis of non-alcoholic steatohepatitis and fibrosis with the use of lipidomics & glycomics and supervised machine learning).
V. Scala et al. Mass Spectrometry-Based Targeted Lipidomics and Supervised Machine Learning Algorithms in Detecting Disease, Cultivar, and Treatment Biomarkers in Xylella fastidiosa subsp. pauca-Infected Olive Trees. DOI: (lipidomics and machine learning in plant-related research).
X. Huang et al. Integrating machine learning and nontargeted plasma lipidomics to explore lipid characteristics of premetabolic syndrome and metabolic syndrome. DOI: (the authors integrate machine learning and nontargeted plasma lipidomics to explore lipid characteristics of premetabolic syndrome and metabolic syndrome).
Y. Chen et al. Metabolomic machine learning predictor for diagnosis and prognosis of gastric cancer. DOI: (gastric cancer diagnostic model built using a random forest algorithm with LASSO feature selection)
P.F. Garrido et al. Lipidomics signature in post-COVID patient sera and its influence on the prolonged inflammatory response. DOI: (random forest model successfully differentiated between symptomatic and asymptomatic post-COVID conditions groups using lipidomic profiles)
S. G. Snowden et al. Combining lipidomics and machine learning to measure clinical lipids in dried blood spots. DOI: - (the authors of the manuscript published in Metabolomics (Springer-Nature) combine lipidomics and machine learning (random forest) to measure 'clinical lipids', i.e. HDL, LDL, TG - in dried blood spots).
T. Deng et al. Lipidomics random forest algorithm of seminal plasma is a promising method for enhancing the diagnosis of necrozoospermia. DOI: (the authors evaluate the efficacy of a lipidomics-based random forest algorithm model in identifying necrozoospermia).
X. Huang et al. Integrating machine learning and nontargeted plasma lipidomics to explore lipid characteristics of premetabolic syndrome and metabolic syndrome. DOI: (the authors integrate machine learning and nontargeted plasma lipidomics to explore lipid characteristics of premetabolic syndrome and metabolic syndrome).
S. Fan et al. Systematic Error Removal Using Random Forest for Normalizing Large-Scale Untargeted Lipidomics Data. DOI: (the authors use random forest for batch effect removal).