Application of selected models to OMICs data
Elements of machine learning using OMICs data
Machine learning in OMICs – Introduction
Machine learning methods, including deep learning and ensemble techniques, are crucial in extracting valuable insights from OMICs data. These algorithms demonstrate exceptional proficiency in identifying complex patterns, categorizing molecular profiles, and forecasting associations within the expansive biological data domain. The incorporation of machine learning not only enhances the efficiency and precision of data analysis but also facilitates the identification of previously unnoticed biological relationships, contributing to novel discoveries in the field.
Metabolomics Prediction: Operates at the level of metabolites, the downstream products of various cellular processes.
Lipidomics Prediction: Specifically addresses lipids.
Genomic predictions: It operates at the level of the genetic code, focusing on the instructions encoded in DNA and its roles in cellular structure and function.
In general, machine learning can be divided into
supervised,
unsupervised,
and semi-supervised.
General Individual Steps for Machine Learning in OMICs
Preprocessing:
Preprocessing OMICs data involves cleaning and transforming the raw data to prepare it for analysis. This step includes handling missing values, normalizing data across different samples, and addressing challenges unique to OMICs datasets, such as batch effects or platform-specific variations.
More information is in the first part of this GitBook – Pre-processing steps.
Normalization:
Normalizing OMIC data is crucial to ensure that different molecular profiles (metabolomics, lipidomics, etc.) are on a comparable scale. This step is particularly important when integrating multiple OMICs datasets or when using machine learning algorithms that are sensitive to differences in data magnitude.
More information is in this GitBook – Normalization steps.
Division into Training and Testing:
The OMICs dataset is partitioned into a training set and a testing set. The training set is employed to train the machine learning model to recognize patterns within the specific OMICs data, while the testing set is reserved to assess the model's predictive performance on unseen samples.
Training:
During the training phase, the machine learning model learns from the patterns present in the training set of OMICs data. This can involve identifying molecular signatures associated with specific outcomes, such as disease states or treatment responses.
Testing:
The trained model is then tested on a separate set of OMICs samples not used during the training phase. This evaluation assesses the model's ability to generalize its predictions to new, unseen OMICs data, providing insights into its real-world predictive capabilities.
Validation:
In the context of predicting OMICs data, a validation set might be employed to fine-tune model parameters, especially when dealing with hyperparameters specific to the chosen machine learning algorithm. This step ensures that the model's performance is optimized for the nuances of the OMICs dataset.
These steps collectively form the predictive modeling workflow tailored to OMICs data, emphasizing the need for meticulous preprocessing and normalization to extract meaningful insights from complex molecular datasets. The ultimate goal is to deploy robust models capable of making accurate predictions in OMICs research.
A review of beneficial software packages for implementing machine learning across the predominant programming languages in computational biology (namely, R and Python) was listed in the study by Sidak et. al:
Here, we have extracted a subset of the test components implemented in R and Python - and provided a more extensive overview of them.
OMICs machine learning in R
In the dynamic landscape of computational biology, integrating machine learning methodologies with OMICs data analysis in the R programming language has emerged as a powerful paradigm.
The table below presents R packages for OMICs machine learning based on Sidak, D. et al. (DOI: 10.3389/fmolb.2022.926623):
caret
Training for Classification and Regression; functions designed for training and visualizing both classification and regression models. Assess feature importance as part of the evaluation process.
cluster
Cluster analysis techniques that build upon the foundation established by Peter Rousseeuw, Anja Struyf, and Mia Hubert. This extension is rooted in the work of Kaufman and Rousseeuw (1990) titled "Finding Groups in Data."
clusterGeneration
Random Cluster Generation with a Defined Degree of Separation.
Deepnet
A comprehensive deep learning toolkit encompassing implementations of various deep learning architectures and neural network algorithms.
Conference abstract: LIANG, Christine A., et al. Proteomics analysis of FLT3ITD mutation in acute myeloid leukemia using deep learning neural network. Annals of Clinical & Laboratory Science, 2019, 49.1: 119-126
E1071
A library dedicated to miscellaneous functions in the domain of statistics and probability theory (formerly known as E1071).
GenABEL
A toolkit for conducting genome-wide association analysis, exploring the relationships between quantitative or binary traits and single-nucleotide polymorphisms (SNPs).
glmnet
Highly efficient methods for fitting the complete lasso or elastic-net regularization paths in linear regression, logistic and multinomial regression models, Poisson regression, Cox models, multiple-response Gaussian models, and grouped multinomial regression
h2o
An open-source machine learning platform providing parallelized implementations of numerous supervised and unsupervised machine learning algorithms.
impute
A function that conducts imputation on a dataset and, in addition to the imputed dataset, yields an object containing learned coefficients and informative data. This object can subsequently be utilized, along with a new dataset, for re-imputation.
limma
Linear models and identification of differentially expressed genes in microarray data.
LMGene
LMGene Software for transforming data and identifying differentially expressed genes in gene expression arrays.
mclust
Finite mixture models with Gaussian distributions fitted using the Expectation-Maximization (EM) algorithm. These models are employed for model-based clustering, classification, and density estimation, featuring Bayesian regularization, dimension reduction for visualization, and resampling-based inference.
mice
The MICE algorithm, or Multivariate Imputation by Chained Equations, includes built-in imputation models tailored for continuous data, binary data, unordered categorical data, and ordered categorical data.
MXNet
MXNet provides a simple yet robust interface for employing deep learning within the R programming language.
neuralnet
Training neural networks using backpropagation is facilitated by this package. It permits flexible configurations by enabling the custom choice of error and activation functions. Additionally, it incorporates the computation of generalized weights.
pls
Partial Least Squares and Principal Component Regression functions.
randomForest
Implements the random forest algorithm for both classification and regression tasks.
survival
Encompasses the fundamental routines for survival analysis such as Multi-state curves, Cox models, and parametric accelerated failure time models.
mixOmics
Multivariate methods for data exploration, integration of different biological data sets, and feature selection: (s)PLS, (s)PCA, and others.
ropls
Pipelines for implementing (O)PLS(-DA) and PCA based on the classic NIPALS algorithm. Bioconductor package.
tidymodels
Tidymodels is a collection of packages for machine learning, modeling, data science, and facilitating all processes before/after building a model.
OMICs machine learning in Python
In the evolving realm of computational biology, the fusion of machine learning methodologies with OMICs data analysis in the Python programming language has become a formidable paradigm, sometimes more used than in R.
The table below presents Python packages for OMICs machine learning based on Sidak, D. et al. (DOI: 10.3389/fmolb.2022.926623):
Captum
A scalable library for model interpretability developed on the PyTorch framework.
EasyNN
A package created to offer a user-friendly Neural Network, designed to function seamlessly with various datasets while providing customization options for users.
Conference abstract: LIANG, Christine A., et al. Proteomics analysis of FLT3ITD mutation in acute myeloid leukemia using deep learning neural network. Annals of Clinical & Laboratory Science, 2019, 49.1: 119-126.
Grakel
An implementation of numerous widely recognized graph kernels.
Gurobi Optimizer
The Gurobi Python API facilitates mathematical optimization through coded modeling.
Keras
Keras provides in-depth exploration of specific topics like layer subclassing, fine-tuning, and model saving.
LIBSVM program
Integrated software for support vector classification, regression, and distribution estimation.
PyTorch
An open-source machine learning framework expediting the transition from research prototyping to production deployment. PyTorch is an optimized tensor library for deep learning on both GPUs and CPUs.
pymc3
Probabilistic Programming in Python – Usage for a machine learning Automated Recommendation Tool for synthetic biology.
scikit-learn
Straightforward and effective tools for predictive data analysis, developed on top of NumPy, SciPy, and matplotlib.
Tensorflow
TensorFlow simplifies the process for both novices and experts to build machine learning models suitable for desktop, mobile, web, and cloud applications.
Theano
A library enabling you to define, optimize, and assess mathematical expressions efficiently, particularly those involving multi-dimensional arrays.
TPOT
TPOT, which stands for Tree-based Pipeline Optimization Tool, automates the process of model selection.
WWL
This repository includes the corresponding code for the NeurIPS 2019 paper on Wasserstein Weisfeiler-Lehman Graph Kernels.
Last updated