💪
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  1. Missing values handling in R
  2. Data imputation by different available R libraries

Data imputation using recipes library (tidymodels)

A part of missing value – data imputation section

The tidymodels collection provides the recipes package containing tools for data preprocessing, including data imputation. This package, as a part of a collection dedicated to machine learning, is usually applied while building models. However, it can be an alternative solution to dplyr and tidyr while preprocessing data (see subchapter Basic data imputation in R with dplyr and tidyr). Here, we will briefly introduce recipes to you.

The principle is relatively simple:

  1. A recipe is generated.

  2. You define the pre-processing workflow within the recipe you created.

  3. The recipe is prepared to be applied for pre-processing (computing all parameters).

  4. The recipe is 'baked', i.e. used to perform pre-processing with all parameters computed in step 3.

Let's change these steps into R code. This time we will need the tidymodels collection:

# Calling library
library(tidymodels)

To build a recipe, we will need to define response variable(s), predictors, and the data set that is the source of this information (the recipe needs the name and types of data). The response variable is a column in our tibble defining the biological group patients were assigned based on the diagnosis. Predictors (the variables used for predicting the response) are, in our case, all lipid concentrations that we measured. To produce a recipe, we write the following line of code:

# Creating a recipe and assigning it to 'recipe': 
recipe <- recipe(Label ~ ., data = data.missing)

NOTE: Here, you see another application of the '~' (tilde) symbol in R. It separates the left side (response variable(s)) from the right side, which usually defines predictors (lipids or metabolites) while building statistical models or performing hypothesis testing. For selected predictors, we use (Label ~ CE 16:1 + CE 16:0 + CE 18:2 + ...). If we use the dot instead of the names of selected lipids or metabolites (Label ~ .), we indicate that predictors are all remaining variables.

If executing this line of code from above creates in your global environment a list named 'recipe', we can move to the next step. Now, we need to define what preprocessing should be performed using our recipe:

# Define operations in the recipe:
recipe <- 
  recipe %>% 
  step_impute_mean(all_numeric())

# Explanations:
# Take 'recipe' list from the global environment,
# Push it through the pipe to step_impute_mean(),
# To all columns containing numeric entries apply mean-based imputation,
# Store the output in 'recipe'.

If all pre-processing options are selected, we need to 'equip' our recipe with all the parameters it needs for our pre-processing. For example, for data imputation by mean, our recipe will need the mean of every numeric column in the 'data.missing' tibble. This is exactly what happens at this step. Using data delivered as a 'training' argument, all necessary parameters are prepared by recipe, so that they can be used in the next step. Here is the necessary line of code:

# Computing all parameters recipe will need for pre-processing:
recipe.prepared <- prep(recipe, training = data.missing)

# Explanations:
# Take 'recipe' from the global environment,
# Compute all parameters necessary to pre-process data as defined above,
# Store the ready-to-apply-to-tibble recipe as 'recipe.prepared'.

As our recipe is ready, we can bake it, meaning - apply it to our data set:

# Executing recipe:
data.imputed <- bake(recipe.prepared, new_data = data.missing)

# Explanations:
# Take 'recipe.prepared' from the global environment,
# Perform data imputation on the data set delivered with new_data argument.
# Store the imputed data as 'data.imputed' in the global environment.

The bake() function contains new_data argument. If a model is built, usually data are split into train and test sets. These sets can be pre-processed independently using the same recipe, as you could supply them separately to the bake() function through the new_data argument.

The R console after executing the code from above:

We obtain the same output as through mutate_if() and replace_na():

PreviousBasic data imputation in R with dplyr and tidyr (tidyverse)NextReplacing NAs via k-nearest neighbor (kNN) model (VIM library)

Last updated 3 months ago

R console: creating recipe for data imputation (recipes package, tidymodels collection).
Replacing NA entries with colum means via recipe package (tidymodels).