đź’Ş
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  • Machine learning in OMICs – Introduction
  • General Individual Steps for Machine Learning in OMICs
  • OMICs machine learning in R
  • OMICs machine learning in Python
  1. OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON

Application of selected models to OMICs data

Elements of machine learning using OMICs data

Machine learning in OMICs – Introduction

Machine learning methods, including deep learning and ensemble techniques, are crucial in extracting valuable insights from OMICs data. These algorithms demonstrate exceptional proficiency in identifying complex patterns, categorizing molecular profiles, and forecasting associations within the expansive biological data domain. The incorporation of machine learning not only enhances the efficiency and precision of data analysis but also facilitates the identification of previously unnoticed biological relationships, contributing to novel discoveries in the field.

Metabolomics Prediction: Operates at the level of metabolites, the downstream products of various cellular processes.

Lipidomics Prediction: Specifically addresses lipids.

Genomic predictions: It operates at the level of the genetic code, focusing on the instructions encoded in DNA and its roles in cellular structure and function.

In general, machine learning can be divided into

  • supervised,

  • unsupervised,

  • and semi-supervised.

General Individual Steps for Machine Learning in OMICs

  1. Preprocessing:

Preprocessing OMICs data involves cleaning and transforming the raw data to prepare it for analysis. This step includes handling missing values, normalizing data across different samples, and addressing challenges unique to OMICs datasets, such as batch effects or platform-specific variations.

More information is in the first part of this GitBook – Pre-processing steps.

  1. Normalization:

Normalizing OMIC data is crucial to ensure that different molecular profiles (metabolomics, lipidomics, etc.) are on a comparable scale. This step is particularly important when integrating multiple OMICs datasets or when using machine learning algorithms that are sensitive to differences in data magnitude.

More information is in this GitBook – Normalization steps.

  1. Division into Training and Testing:

The OMICs dataset is partitioned into a training set and a testing set. The training set is employed to train the machine learning model to recognize patterns within the specific OMICs data, while the testing set is reserved to assess the model's predictive performance on unseen samples.

  1. Training:

During the training phase, the machine learning model learns from the patterns present in the training set of OMICs data. This can involve identifying molecular signatures associated with specific outcomes, such as disease states or treatment responses.

  1. Testing:

The trained model is then tested on a separate set of OMICs samples not used during the training phase. This evaluation assesses the model's ability to generalize its predictions to new, unseen OMICs data, providing insights into its real-world predictive capabilities.

  1. Validation:

In the context of predicting OMICs data, a validation set might be employed to fine-tune model parameters, especially when dealing with hyperparameters specific to the chosen machine learning algorithm. This step ensures that the model's performance is optimized for the nuances of the OMICs dataset.

These steps collectively form the predictive modeling workflow tailored to OMICs data, emphasizing the need for meticulous preprocessing and normalization to extract meaningful insights from complex molecular datasets. The ultimate goal is to deploy robust models capable of making accurate predictions in OMICs research.

A review of beneficial software packages for implementing machine learning across the predominant programming languages in computational biology (namely, R and Python) was listed in the study by Sidak et. al:

Here, we have extracted a subset of the test components implemented in R and Python - and provided a more extensive overview of them.

OMICs machine learning in R

In the dynamic landscape of computational biology, integrating machine learning methodologies with OMICs data analysis in the R programming language has emerged as a powerful paradigm.

The table below presents R packages for OMICs machine learning based on Sidak, D. et al. (DOI: 10.3389/fmolb.2022.926623):

Name package / library
Description
Use for OMICs data

caret

Training for Classification and Regression; functions designed for training and visualizing both classification and regression models. Assess feature importance as part of the evaluation process.

cluster

Cluster analysis techniques that build upon the foundation established by Peter Rousseeuw, Anja Struyf, and Mia Hubert. This extension is rooted in the work of Kaufman and Rousseeuw (1990) titled "Finding Groups in Data."

clusterGeneration

Random Cluster Generation with a Defined Degree of Separation.

Deepnet

A comprehensive deep learning toolkit encompassing implementations of various deep learning architectures and neural network algorithms.

Conference abstract: LIANG, Christine A., et al. Proteomics analysis of FLT3ITD mutation in acute myeloid leukemia using deep learning neural network. Annals of Clinical & Laboratory Science, 2019, 49.1: 119-126

E1071

A library dedicated to miscellaneous functions in the domain of statistics and probability theory (formerly known as E1071).

GenABEL

A toolkit for conducting genome-wide association analysis, exploring the relationships between quantitative or binary traits and single-nucleotide polymorphisms (SNPs).

glmnet

Highly efficient methods for fitting the complete lasso or elastic-net regularization paths in linear regression, logistic and multinomial regression models, Poisson regression, Cox models, multiple-response Gaussian models, and grouped multinomial regression

h2o

An open-source machine learning platform providing parallelized implementations of numerous supervised and unsupervised machine learning algorithms.

impute

A function that conducts imputation on a dataset and, in addition to the imputed dataset, yields an object containing learned coefficients and informative data. This object can subsequently be utilized, along with a new dataset, for re-imputation.

limma

Linear models and identification of differentially expressed genes in microarray data.

LMGene

LMGene Software for transforming data and identifying differentially expressed genes in gene expression arrays.

mclust

Finite mixture models with Gaussian distributions fitted using the Expectation-Maximization (EM) algorithm. These models are employed for model-based clustering, classification, and density estimation, featuring Bayesian regularization, dimension reduction for visualization, and resampling-based inference.

mice

The MICE algorithm, or Multivariate Imputation by Chained Equations, includes built-in imputation models tailored for continuous data, binary data, unordered categorical data, and ordered categorical data.

MXNet

MXNet provides a simple yet robust interface for employing deep learning within the R programming language.

neuralnet

Training neural networks using backpropagation is facilitated by this package. It permits flexible configurations by enabling the custom choice of error and activation functions. Additionally, it incorporates the computation of generalized weights.

pls

Partial Least Squares and Principal Component Regression functions.

randomForest

Implements the random forest algorithm for both classification and regression tasks.

survival

Encompasses the fundamental routines for survival analysis such as Multi-state curves, Cox models, and parametric accelerated failure time models.

mixOmics

Multivariate methods for data exploration, integration of different biological data sets, and feature selection: (s)PLS, (s)PCA, and others.

ropls

Pipelines for implementing (O)PLS(-DA) and PCA based on the classic NIPALS algorithm. Bioconductor package.

tidymodels

Tidymodels is a collection of packages for machine learning, modeling, data science, and facilitating all processes before/after building a model.

OMICs machine learning in Python

In the evolving realm of computational biology, the fusion of machine learning methodologies with OMICs data analysis in the Python programming language has become a formidable paradigm, sometimes more used than in R.

The table below presents Python packages for OMICs machine learning based on Sidak, D. et al. (DOI: 10.3389/fmolb.2022.926623):

Name package / library
Description
Use of OMICs data

Captum

A scalable library for model interpretability developed on the PyTorch framework.

EasyNN

A package created to offer a user-friendly Neural Network, designed to function seamlessly with various datasets while providing customization options for users.

Conference abstract: LIANG, Christine A., et al. Proteomics analysis of FLT3ITD mutation in acute myeloid leukemia using deep learning neural network. Annals of Clinical & Laboratory Science, 2019, 49.1: 119-126.

Grakel

An implementation of numerous widely recognized graph kernels.

Gurobi Optimizer

The Gurobi Python API facilitates mathematical optimization through coded modeling.

Keras

Keras provides in-depth exploration of specific topics like layer subclassing, fine-tuning, and model saving.

LIBSVM program

Integrated software for support vector classification, regression, and distribution estimation.

PyTorch

An open-source machine learning framework expediting the transition from research prototyping to production deployment. PyTorch is an optimized tensor library for deep learning on both GPUs and CPUs.

pymc3

Probabilistic Programming in Python – Usage for a machine learning Automated Recommendation Tool for synthetic biology.

scikit-learn

Straightforward and effective tools for predictive data analysis, developed on top of NumPy, SciPy, and matplotlib.

Tensorflow

TensorFlow simplifies the process for both novices and experts to build machine learning models suitable for desktop, mobile, web, and cloud applications.

Theano

A library enabling you to define, optimize, and assess mathematical expressions efficiently, particularly those involving multi-dimensional arrays.

TPOT

TPOT, which stands for Tree-based Pipeline Optimization Tool, automates the process of model selection.

WWL

This repository includes the corresponding code for the NeurIPS 2019 paper on Wasserstein Weisfeiler-Lehman Graph Kernels.

PreviousClustered heatmapsNextOMICs machine learning – Examples

Last updated 7 months ago

https://doi.org/10.1021/acs.jproteome.7b00595
https://doi.org/10.1016/j.ymeth.2019.03.004
https://doi.org/10.3390/metabo7020030
https://doi.org/10.1021/acs.analchem.7b03795
https://doi.org/10.1016/j.aca.2018.02.045
https://doi.org/10.1007/s11306-017-1239-2
https://doi.org/10.1007/s11306-017-1239-2
https://www.mdpi.com/2072-6694/14/19/4622
https://doi.org/10.1021/acs.jproteome.7b00595
https://doi.org/10.1038/ncomms13090
https://doi.org/10.1016/j.ccell.2020.09.014
https://doi.org/10.1021/acs.jproteome.7b00595
https://doi.org/10.1038/ncomms13090
https://doi.org/10.1016/j.cell.2019.04.016
https://doi.org/10.1038/ncomms13090
https://doi.org/10.1007/s11306-017-1239-2
https://doi.org/10.1016/j.aca.2018.02.045
https://doi.org/10.3390/metabo7020030
https://doi.org/10.1021/acs.analchem.7b03795
https://doi.org/10.1021/acs.analchem.7b03795
https://doi.org/10.15252/msb.20188497
http://mixomics.org/case-studies/
https://bioconductor.org/packages/devel/bioc/vignettes/ropls/inst/doc/ropls-vignette.html
https://pubs.acs.org/doi/10.1021/acs.jproteome.5b00354
https://www.frontiersin.org/articles/10.3389/fimmu.2021.742736/full#B15
https://www.tidymodels.org/
https://www.mdpi.com/2072-6694/14/19/4622
https://www.nature.com/articles/s41398-021-01632-z
https://doi.org/10.1093/bioinformatics/btaa866
https://doi.org/10.1093/bioinformatics/btaa655
https://doi.org/10.1016/j.cell.2019.04.016
https://doi.org/10.1093/bioinformatics/btab285
https://doi.org/10.1038/s42256-020-00244-4
https://doi.org/10.1073/pnas.2002959117
https://doi.org/10.1016/j.ymeth.2019.03.004
https://doi.org/10.1007/s11306-019-1612-4
https://doi.org/10.1016/j.cels.2016.03.001
https://doi.org/10.1016/j.ccell.2020.09.014
https://doi.org/10.1093/bioinformatics/btaa866
https://doi.org/10.1109/BIBM.2018.8621345
https://doi.org/10.1038/s41467-020-18008-4
https://doi.org/10.1016/j.cell.2019.04.016
https://doi.org/10.1186/s12859-021-04209-1
https://doi.org/10.1038/s41467-020-18008-4
https://doi.org/10.15252/msb.20188497
https://doi.org/10.1038/nmeth.4627
https://doi.org/10.1038/s41540-018-0054-3
https://doi.org/10.1093/bioinformatics/btaa655
Interpretable machine learning methods for predictions in systems biology from omics dataFrontiers
Sidak, D., Schwarzerová, J., Weckwerth, W. and Waldherr, S., 2022. Interpretable machine learning methods for predictions in systems biology from omics data. Frontiers in Molecular Biosciences, 9, p.926623.
Logo