💪
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  • Data transformation with recipes
  • Data centering with recipes
  • Data scaling with recipes
  1. Data transformation, scaling, and normalization in R
  2. Data transformation and scaling using different available R packages

Data transformation and scaling using recipes R package

A part of data transformation & normalization – using available R packages

PreviousData transformation and scaling using mutate()NextData Normalization – bestNormalize R package

Last updated 1 year ago

In the field of machine learning, the term "recipe" commonly denotes a series of procedures or transformations executed on a data set prior to model training. This critical step, not only in OMICs analysis, ensures that our data are well-prepared, allowing the model to discern patterns and generate precise predictions. Analogous to the meticulous crafting of a culinary masterpiece, the 'recipes' package in R () is a versatile tool for orchestrating these preparatory data transformations. A chef carefully selects and combines ingredients to achieve a delectable dish, and data scientists employ the 'recipes' package to methodically craft a tailored data preprocessing plan, enhancing the effectiveness of subsequent modeling endeavors.

The recipes R package for metabolomics and lipidomics data preprocessing in machine learning also offers data transformation and scaling functions. Here, we will show how to transform the entire data set and then scale it via recipes. We will start with log and square root transformations. Similarly to the data imputation, we need to repeat four steps for the transformation of our metabolomics/lipidomics data:

  • generate a recipe,

  • define steps for data pre-processing,

  • prepare the recipe (calculate all values necessary for pre-processing),

  • and bake it (execute it for the data set).

We need to call tidymodels library before starting:

# Calling tidymodels library
library(tidymodels)

Data transformation with recipes

# Data transformation via recipes
# 1. Log transformation
# Generating recipe
recipe <- recipe(Label ~ ., data = data)

# Defining pre-processing steps
recipe <- 
  recipe %>% 
  step_log(all_numeric()) # log() transformation, base = exp(1)

# Computing all necessary values for preprocessing:
recipe.prepared <- prep(recipe, training = data)

# Executing recipe:
data.log.transformed <- bake(recipe.prepared, new_data = data)

print(data.log.transformed)

The output:

The step_log() function contains argument base, so we can also perform log10, log2, log5 transformation:

# Using logarithm with a selected base for the transformation:
# Important: select only 1 option, don't run all three lines:
recipe <- 
  recipe %>% 
  step_log(all_numeric(), base = 10) # log10-transformation

recipe <- 
  recipe %>% 
  step_log(all_numeric(), base = 2) # log2-transformation

recipe <- 
  recipe %>% 
  step_log(all_numeric(), base = 5) # log5-transformation

# Then, continue with prep() and bake().

A simple change in this code allows for the square-root transformation:

# Data transformation via recipes
# 2. Square root transformation
# Generating recipe
recipe <- recipe(Label ~ ., data = data)

# Defining pre-processing steps
recipe <- 
  recipe %>% 
  step_sqrt(all_numeric()) 

# Computing all necessary values for preprocessing:
recipe.prepared <- prep(recipe, training = data)

# Executing recipe:
data.sqrt.transformed <- bake(recipe.prepared, new_data = data)

print(data.sqrt.transformed)

And the output:

Data centering with recipes

Based on similar principles to the imputation and transformation, the data can also be centered using recipes:

# Data centering via recipes:
# Generating recipe
recipe <- recipe(Label ~ ., data = data)

# Defining pre-processing steps
recipe <- 
  recipe %>% 
  step_center(all_numeric())

# Computing all necessary values for preprocessing:
recipe.prepared <- prep(recipe, training = data)

# Executing recipe:
data.centered <- bake(recipe.prepared, new_data = data)

print(data.centered)

And the centered data:

Data scaling with recipes

Our data can be Auto-, Pareto-, Range, Level, and Vast-scaled via recipes. However, the recipe-based method for data scaling, except for Autoscaling, is more complicated than if using mutate_if() function. We will explain to you here scaling via recipes library. First, we will start with Autoscaling, which is easily performed via step_scale() function:

# Data scaling via recipes
# 1. Autoscaling
# Generating recipe
recipe <- recipe(Label ~ ., data = data)

# Defining pre-processing steps
recipe <- recipe %>% 
  step_log(all_numeric(), base = 10) %>%
  step_center(all_numeric()) %>% # We need to center the data separately in recipes
  step_scale(all_numeric())

# Computing all necessary values for preprocessing:
recipe.prepared <- prep(recipe, training = data)

# Executing recipe:
data.transformed.centered.scaled <- bake(recipe.prepared, new_data = data)

print(data.transformed.centered.scaled)

The output:

Unfortunately, recipes do not allow for a direct Pareto scaling of the numeric values, e.g., via step_pareto() function or by modifying the arguments of step_scale() function. The same applies to Range, Level, and Vast scaling. If you check the object recipe.prepared, it is a list that stores more lists containing all the information required for the data preprocessing. If we access elements of these lists, we will reach the vector with standard deviations applied to scale data (for Autoscaling):

# Data scaling via recipes
# 2. Pareto scaling
# Generating recipe
recipe <- recipe(Label ~ ., data = data)

# Defining pre-processing steps
recipe <- recipe %>% 
  step_log(all_numeric(), base = 10) %>%
  step_center(all_numeric()) %>% # We need to center the data separately in recipes
  step_scale(all_numeric())

# Computing all necessary values for preprocessing:
recipe.prepared <- prep(recipe, training = data)

# Accesing the lists of 'recipe.prepared':
recipe.prepared$steps[[3]]$sds

If we change this vector for a vector containing the square root of standard deviations, we can perform Pareto scaling. The square root of the standard deviation for every column of our tibble with lipidomics data can be again computed using sapply() function.

# Computing a vector with sqrt of columns' standard deviations:
sqrt.sd <- 
  data.log10.transformed %>%    # We use log10-transformed data here
  select(-`Sample Name`,       # We remove `Sample Name` and `Label` for these computations
         -`Label`) %>%
  sapply(function(x) sqrt(sd(x)))

# Changing the sds vector stored in 'recipe.prepared' with 'sqrt.sds' vector:
recipe.prepared$steps[[3]]$sds <- sqrt.sd

# Executing recipe:
data.transformed.centered.Pareto.scaled <- bake(recipe.prepared, new_data = data)

print(data.transformed.centered.Pareto.scaled)

The output:

If we compare the code from above to scaling via mutate_if(), the second method is significantly simpler.

Range scaling in the understanding of recipes is performed differently than how we presented it earlier. Therefore, we will just mention it here briefly. Recipes contain step_range() function. The step_range() function contains two arguments min and max, where min and max values must be introduced, which will be used to calculate the range. By default, this value is set to 0 and 1.

# Data scaling via recipes
# 3a. Range scaling (recipes version)
# Generating recipe
recipe <- recipe(Label ~ ., data = data)

# Defining pre-processing steps
recipe <- recipe %>% 
  step_log(all_numeric(), base = 10) %>%
  step_center(all_numeric()) %>% # We need to center the data separately in recipes
  step_range(all_numeric())

# Computing all necessary values for preprocessing:
recipe.prepared <- prep(recipe, training = data)

# Baking the recipe:
data.transformed.centered.Range.scaled <- bake(recipe.prepared, new_data = data)

The classic Range scaling as presented by Robert A. van den Berg et al. in their manuscript could be performed again by substituting sds values in the 'recipe.prepared' list with the difference between maximum as minimum values, computed for each column:

# Data scaling via recipes
# 3b. Range scaling (according to the formula from the manuscript)
# Generating recipe
recipe <- recipe(Label ~ ., data = data)

# Defining pre-processing steps
recipe <- recipe %>% 
  step_log(all_numeric(), base = 10) %>%
  step_center(all_numeric()) %>% # We need to center the data separately in recipes
  step_scale(all_numeric())

# Computing all necessary values for preprocessing:
recipe.prepared <- prep(recipe, training = data)

# Computing the difference between max and min column values for each column
diff.max.min <- 
  data.log10.transformed %>%    # We use log10-transformed data here
  select(-`Sample Name`,       # We remove `Sample Name` and `Label` for these computations
         -`Label`) %>%
  sapply(function(x) max(x)-min(x))
  
# Changing the sds for diff.max.min:
recipe.prepared$steps[[3]]$sds <- diff.max.min

# Baking the recipe:
data.transformed.centered.Range.scaled <- bake(recipe.prepared, new_data = data)

print(data.transformed.centered.Range.scaled)

The final output:

We could do Level scaling, using the same 'trick':

# Data scaling via recipes
# 4. Level scaling (according to the formula from the manuscript)
# Generating recipe
recipe <- recipe(Label ~ ., data = data)

# Defining pre-processing steps
recipe <- recipe %>% 
  step_log(all_numeric(), base = 10) %>%
  step_center(all_numeric()) %>% # We need to center the data separately in recipes
  step_scale(all_numeric())

# Computing all necessary values for preprocessing:
recipe.prepared <- prep(recipe, training = data)

# Computing the difference between max and min column values for each column
mean <- 
  data.log10.transformed %>%    # We use log10-transformed data here
  select(-`Sample Name`,       # We remove `Sample Name` and `Label` for these computations
         -`Label`) %>%
  sapply(function(x) mean(x))
  
# Changing the sds for 'mean' vector:
recipe.prepared$steps[[3]]$sds <- mean

# Baking the recipe:
data.transformed.centered.Level.scaled <- bake(recipe.prepared, new_data = data)

print(data.transformed.centered.Level.scaled)

The output:

And finally the Vast scaling:

# Data scaling via recipes
# 5. Vast scaling (according to the formula from the manuscript)
# Generating recipe
recipe <- recipe(Label ~ ., data = data)

# Defining pre-processing steps
recipe <- recipe %>% 
  step_log(all_numeric(), base = 10) %>%
  step_center(all_numeric()) %>% # We need to center the data separately in recipes
  step_scale(all_numeric())

# Computing all necessary values for preprocessing:
recipe.prepared <- prep(recipe, training = data)

# Computing the difference between max and min column values for each column
vast <- 
  data.log10.transformed %>%    # We use log10-transformed data here
  select(-`Sample Name`,       # We remove `Sample Name` and `Label` for these computations
         -`Label`) %>%
  sapply(function(x) sd(x)^2/mean(x))

# Changing the sds for 'mean' vector:
recipe.prepared$steps[[3]]$sds <- vast

# Baking the recipe:
data.transformed.centered.Vast.scaled <- bake(recipe.prepared, new_data = data)

print(data.transformed.centered.Vast.scaled)

The output:

https://recipes.tidymodels.org/
Data transformation with the natural logarithm via recipes (tidymodels).
Data transformation with the square root via recipes (tidymodels).
Data centered via recipes (tidymodels).
Data log-transformed, centered, and scaled via recipes (tidymodels).
Analyzing the content of the 'recipe.prepared' list. Standard deviations that our recipe will use for data scaling.
Data log10-transformed, centered, and Pareto-scaled via recipes (tidymodels).
Data log10-transformed, centered, and Range-scaled via recipes (tidymodels).
Data log10-transformed, centered, and Level-scaled via recipes (tidymodels).
Data log10-transformed, centered, and Vast-scaled via recipes (tidymodels).