Application of selected models to OMICs data

Elements of machine learning using OMICs data

Machine learning in OMICs – Introduction

Machine learning methods, including deep learning and ensemble techniques, are crucial in extracting valuable insights from OMICs data. These algorithms demonstrate exceptional proficiency in identifying complex patterns, categorizing molecular profiles, and forecasting associations within the expansive biological data domain. The incorporation of machine learning not only enhances the efficiency and precision of data analysis but also facilitates the identification of previously unnoticed biological relationships, contributing to novel discoveries in the field.

Metabolomics Prediction: Operates at the level of metabolites, the downstream products of various cellular processes.

Lipidomics Prediction: Specifically addresses lipids.

Genomic predictions: It operates at the level of the genetic code, focusing on the instructions encoded in DNA and its roles in cellular structure and function.

In general, machine learning can be divided into

  • supervised,

  • unsupervised,

  • and semi-supervised.

General Individual Steps for Machine Learning in OMICs

  1. Preprocessing:

Preprocessing OMICs data involves cleaning and transforming the raw data to prepare it for analysis. This step includes handling missing values, normalizing data across different samples, and addressing challenges unique to OMICs datasets, such as batch effects or platform-specific variations.

More information is in the first part of this GitBook – Pre-processing steps.

  1. Normalization:

Normalizing OMIC data is crucial to ensure that different molecular profiles (metabolomics, lipidomics, etc.) are on a comparable scale. This step is particularly important when integrating multiple OMICs datasets or when using machine learning algorithms that are sensitive to differences in data magnitude.

More information is in this GitBook – Normalization steps.

  1. Division into Training and Testing:

The OMICs dataset is partitioned into a training set and a testing set. The training set is employed to train the machine learning model to recognize patterns within the specific OMICs data, while the testing set is reserved to assess the model's predictive performance on unseen samples.

  1. Training:

During the training phase, the machine learning model learns from the patterns present in the training set of OMICs data. This can involve identifying molecular signatures associated with specific outcomes, such as disease states or treatment responses.

  1. Testing:

The trained model is then tested on a separate set of OMICs samples not used during the training phase. This evaluation assesses the model's ability to generalize its predictions to new, unseen OMICs data, providing insights into its real-world predictive capabilities.

  1. Validation:

In the context of predicting OMICs data, a validation set might be employed to fine-tune model parameters, especially when dealing with hyperparameters specific to the chosen machine learning algorithm. This step ensures that the model's performance is optimized for the nuances of the OMICs dataset.

These steps collectively form the predictive modeling workflow tailored to OMICs data, emphasizing the need for meticulous preprocessing and normalization to extract meaningful insights from complex molecular datasets. The ultimate goal is to deploy robust models capable of making accurate predictions in OMICs research.

A review of beneficial software packages for implementing machine learning across the predominant programming languages in computational biology (namely, R and Python) was listed in the study by Sidak et. al:

Sidak, D., Schwarzerová, J., Weckwerth, W. and Waldherr, S., 2022. Interpretable machine learning methods for predictions in systems biology from omics data. Frontiers in Molecular Biosciences, 9, p.926623.

Here, we have extracted a subset of the test components implemented in R and Python - and provided a more extensive overview of them.

OMICs machine learning in R

In the dynamic landscape of computational biology, integrating machine learning methodologies with OMICs data analysis in the R programming language has emerged as a powerful paradigm.

The table below presents R packages for OMICs machine learning based on Sidak, D. et al. (DOI: 10.3389/fmolb.2022.926623):

Name package / library
Description
Use for OMICs data

caret

Training for Classification and Regression; functions designed for training and visualizing both classification and regression models. Assess feature importance as part of the evaluation process.

cluster

Cluster analysis techniques that build upon the foundation established by Peter Rousseeuw, Anja Struyf, and Mia Hubert. This extension is rooted in the work of Kaufman and Rousseeuw (1990) titled "Finding Groups in Data."

clusterGeneration

Random Cluster Generation with a Defined Degree of Separation.

Deepnet

A comprehensive deep learning toolkit encompassing implementations of various deep learning architectures and neural network algorithms.

Conference abstract: LIANG, Christine A., et al. Proteomics analysis of FLT3ITD mutation in acute myeloid leukemia using deep learning neural network. Annals of Clinical & Laboratory Science, 2019, 49.1: 119-126

E1071

A library dedicated to miscellaneous functions in the domain of statistics and probability theory (formerly known as E1071).

GenABEL

A toolkit for conducting genome-wide association analysis, exploring the relationships between quantitative or binary traits and single-nucleotide polymorphisms (SNPs).

glmnet

Highly efficient methods for fitting the complete lasso or elastic-net regularization paths in linear regression, logistic and multinomial regression models, Poisson regression, Cox models, multiple-response Gaussian models, and grouped multinomial regression

h2o

An open-source machine learning platform providing parallelized implementations of numerous supervised and unsupervised machine learning algorithms.

impute

A function that conducts imputation on a dataset and, in addition to the imputed dataset, yields an object containing learned coefficients and informative data. This object can subsequently be utilized, along with a new dataset, for re-imputation.

limma

Linear models and identification of differentially expressed genes in microarray data.

LMGene

LMGene Software for transforming data and identifying differentially expressed genes in gene expression arrays.

mclust

Finite mixture models with Gaussian distributions fitted using the Expectation-Maximization (EM) algorithm. These models are employed for model-based clustering, classification, and density estimation, featuring Bayesian regularization, dimension reduction for visualization, and resampling-based inference.

mice

The MICE algorithm, or Multivariate Imputation by Chained Equations, includes built-in imputation models tailored for continuous data, binary data, unordered categorical data, and ordered categorical data.

MXNet

MXNet provides a simple yet robust interface for employing deep learning within the R programming language.

neuralnet

Training neural networks using backpropagation is facilitated by this package. It permits flexible configurations by enabling the custom choice of error and activation functions. Additionally, it incorporates the computation of generalized weights.

pls

Partial Least Squares and Principal Component Regression functions.

randomForest

Implements the random forest algorithm for both classification and regression tasks.

survival

Encompasses the fundamental routines for survival analysis such as Multi-state curves, Cox models, and parametric accelerated failure time models.

mixOmics

Multivariate methods for data exploration, integration of different biological data sets, and feature selection: (s)PLS, (s)PCA, and others.

tidymodels

Tidymodels is a collection of packages for machine learning, modeling, data science, and facilitating all processes before/after building a model.

OMICs machine learning in Python

In the evolving realm of computational biology, the fusion of machine learning methodologies with OMICs data analysis in the Python programming language has become a formidable paradigm, sometimes more used than in R.

The table below presents Python packages for OMICs machine learning based on Sidak, D. et al. (DOI: 10.3389/fmolb.2022.926623):

Name package / library
Description
Use of OMICs data

Captum

A scalable library for model interpretability developed on the PyTorch framework.

EasyNN

A package created to offer a user-friendly Neural Network, designed to function seamlessly with various datasets while providing customization options for users.

Conference abstract: LIANG, Christine A., et al. Proteomics analysis of FLT3ITD mutation in acute myeloid leukemia using deep learning neural network. Annals of Clinical & Laboratory Science, 2019, 49.1: 119-126.

Grakel

An implementation of numerous widely recognized graph kernels.

Gurobi Optimizer

The Gurobi Python API facilitates mathematical optimization through coded modeling.

LIBSVM program

Integrated software for support vector classification, regression, and distribution estimation.

PyTorch

An open-source machine learning framework expediting the transition from research prototyping to production deployment. PyTorch is an optimized tensor library for deep learning on both GPUs and CPUs.

pymc3

Probabilistic Programming in Python – Usage for a machine learning Automated Recommendation Tool for synthetic biology.

scikit-learn

Straightforward and effective tools for predictive data analysis, developed on top of NumPy, SciPy, and matplotlib.

Tensorflow

TensorFlow simplifies the process for both novices and experts to build machine learning models suitable for desktop, mobile, web, and cloud applications.

Theano

A library enabling you to define, optimize, and assess mathematical expressions efficiently, particularly those involving multi-dimensional arrays.

TPOT

TPOT, which stands for Tree-based Pipeline Optimization Tool, automates the process of model selection.

WWL

This repository includes the corresponding code for the NeurIPS 2019 paper on Wasserstein Weisfeiler-Lehman Graph Kernels.

Last updated