Data transformation and scaling using recipes R package

A part of data transformation & normalization – using available R packages

In the field of machine learning, the term "recipe" commonly denotes a series of procedures or transformations executed on a data set prior to model training. This critical step, not only in OMICs analysis, ensures that our data are well-prepared, allowing the model to discern patterns and generate precise predictions. Analogous to the meticulous crafting of a culinary masterpiece, the 'recipes' package in R (https://recipes.tidymodels.org/) is a versatile tool for orchestrating these preparatory data transformations. A chef carefully selects and combines ingredients to achieve a delectable dish, and data scientists employ the 'recipes' package to methodically craft a tailored data preprocessing plan, enhancing the effectiveness of subsequent modeling endeavors.

The recipes R package for metabolomics and lipidomics data preprocessing in machine learning also offers data transformation and scaling functions. Here, we will show how to transform the entire data set and then scale it via recipes. We will start with log and square root transformations. Similarly to the data imputation, we need to repeat four steps for the transformation of our metabolomics/lipidomics data:

generate a recipe,
define steps for data pre-processing,
prepare the recipe (calculate all values necessary for pre-processing),
and bake it (execute it for the data set).

We need to call tidymodels library before starting:

# Calling tidymodels library
library(tidymodels)

Data transformation with recipes

# Data transformation via recipes
# 1. Log transformation
# Generating recipe
recipe <- recipe(Label ~ ., data = data)

# Defining pre-processing steps
recipe <- 
  recipe %>% 
  step_log(all_numeric()) # log() transformation, base = exp(1)

# Computing all necessary values for preprocessing:
recipe.prepared <- prep(recipe, training = data)

# Executing recipe:
data.log.transformed <- bake(recipe.prepared, new_data = data)

print(data.log.transformed)

The output:

The step_log() function contains argument base, so we can also perform log10, log2, log5 transformation:

# Using logarithm with a selected base for the transformation:
# Important: select only 1 option, don't run all three lines:
recipe <- 
  recipe %>% 
  step_log(all_numeric(), base = 10) # log10-transformation

recipe <- 
  recipe %>% 
  step_log(all_numeric(), base = 2) # log2-transformation

recipe <- 
  recipe %>% 
  step_log(all_numeric(), base = 5) # log5-transformation

# Then, continue with prep() and bake().

A simple change in this code allows for the square-root transformation:

# Data transformation via recipes
# 2. Square root transformation
# Generating recipe
recipe <- recipe(Label ~ ., data = data)

# Defining pre-processing steps
recipe <- 
  recipe %>% 
  step_sqrt(all_numeric()) 

# Computing all necessary values for preprocessing:
recipe.prepared <- prep(recipe, training = data)

# Executing recipe:
data.sqrt.transformed <- bake(recipe.prepared, new_data = data)

print(data.sqrt.transformed)

And the output:

Data centering with recipes

Based on similar principles to the imputation and transformation, the data can also be centered using recipes:

# Data centering via recipes:
# Generating recipe
recipe <- recipe(Label ~ ., data = data)

# Defining pre-processing steps
recipe <- 
  recipe %>% 
  step_center(all_numeric())

# Computing all necessary values for preprocessing:
recipe.prepared <- prep(recipe, training = data)

# Executing recipe:
data.centered <- bake(recipe.prepared, new_data = data)

print(data.centered)

And the centered data:

Data scaling with recipes

Our data can be Auto-, Pareto-, Range, Level, and Vast-scaled via recipes. However, the recipe-based method for data scaling, except for Autoscaling, is more complicated than if using mutate_if() function. We will explain to you here scaling via recipes library. First, we will start with Autoscaling, which is easily performed via step_scale() function:

# Data scaling via recipes
# 1. Autoscaling
# Generating recipe
recipe <- recipe(Label ~ ., data = data)

# Defining pre-processing steps
recipe <- recipe %>% 
  step_log(all_numeric(), base = 10) %>%
  step_center(all_numeric()) %>% # We need to center the data separately in recipes
  step_scale(all_numeric())

# Computing all necessary values for preprocessing:
recipe.prepared <- prep(recipe, training = data)

# Executing recipe:
data.transformed.centered.scaled <- bake(recipe.prepared, new_data = data)

print(data.transformed.centered.scaled)

The output:

Unfortunately, recipes do not allow for a direct Pareto scaling of the numeric values, e.g., via step_pareto() function or by modifying the arguments of step_scale() function. The same applies to Range, Level, and Vast scaling. If you check the object recipe.prepared, it is a list that stores more lists containing all the information required for the data preprocessing. If we access elements of these lists, we will reach the vector with standard deviations applied to scale data (for Autoscaling):

# Data scaling via recipes
# 2. Pareto scaling
# Generating recipe
recipe <- recipe(Label ~ ., data = data)

# Defining pre-processing steps
recipe <- recipe %>% 
  step_log(all_numeric(), base = 10) %>%
  step_center(all_numeric()) %>% # We need to center the data separately in recipes
  step_scale(all_numeric())

# Computing all necessary values for preprocessing:
recipe.prepared <- prep(recipe, training = data)

# Accesing the lists of 'recipe.prepared':
recipe.prepared$steps[[3]]$sds

If we change this vector for a vector containing the square root of standard deviations, we can perform Pareto scaling. The square root of the standard deviation for every column of our tibble with lipidomics data can be again computed using sapply() function.

# Computing a vector with sqrt of columns' standard deviations:
sqrt.sd <- 
  data.log10.transformed %>%    # We use log10-transformed data here
  select(-`Sample Name`,       # We remove `Sample Name` and `Label` for these computations
         -`Label`) %>%
  sapply(function(x) sqrt(sd(x)))

# Changing the sds vector stored in 'recipe.prepared' with 'sqrt.sds' vector:
recipe.prepared$steps[[3]]$sds <- sqrt.sd

# Executing recipe:
data.transformed.centered.Pareto.scaled <- bake(recipe.prepared, new_data = data)

print(data.transformed.centered.Pareto.scaled)

The output:

If we compare the code from above to scaling via mutate_if(), the second method is significantly simpler.

Range scaling in the understanding of recipes is performed differently than how we presented it earlier. Therefore, we will just mention it here briefly. Recipes contain step_range() function. The step_range() function contains two arguments min and max, where min and max values must be introduced, which will be used to calculate the range. By default, this value is set to 0 and 1.

# Data scaling via recipes
# 3a. Range scaling (recipes version)
# Generating recipe
recipe <- recipe(Label ~ ., data = data)

# Defining pre-processing steps
recipe <- recipe %>% 
  step_log(all_numeric(), base = 10) %>%
  step_center(all_numeric()) %>% # We need to center the data separately in recipes
  step_range(all_numeric())

# Computing all necessary values for preprocessing:
recipe.prepared <- prep(recipe, training = data)

# Baking the recipe:
data.transformed.centered.Range.scaled <- bake(recipe.prepared, new_data = data)

The classic Range scaling as presented by Robert A. van den Berg et al. in their manuscript could be performed again by substituting sds values in the 'recipe.prepared' list with the difference between maximum as minimum values, computed for each column:

# Data scaling via recipes
# 3b. Range scaling (according to the formula from the manuscript)
# Generating recipe
recipe <- recipe(Label ~ ., data = data)

# Defining pre-processing steps
recipe <- recipe %>% 
  step_log(all_numeric(), base = 10) %>%
  step_center(all_numeric()) %>% # We need to center the data separately in recipes
  step_scale(all_numeric())

# Computing all necessary values for preprocessing:
recipe.prepared <- prep(recipe, training = data)

# Computing the difference between max and min column values for each column
diff.max.min <- 
  data.log10.transformed %>%    # We use log10-transformed data here
  select(-`Sample Name`,       # We remove `Sample Name` and `Label` for these computations
         -`Label`) %>%
  sapply(function(x) max(x)-min(x))
  
# Changing the sds for diff.max.min:
recipe.prepared$steps[[3]]$sds <- diff.max.min

# Baking the recipe:
data.transformed.centered.Range.scaled <- bake(recipe.prepared, new_data = data)

print(data.transformed.centered.Range.scaled)

The final output:

We could do Level scaling, using the same 'trick':

# Data scaling via recipes
# 4. Level scaling (according to the formula from the manuscript)
# Generating recipe
recipe <- recipe(Label ~ ., data = data)

# Defining pre-processing steps
recipe <- recipe %>% 
  step_log(all_numeric(), base = 10) %>%
  step_center(all_numeric()) %>% # We need to center the data separately in recipes
  step_scale(all_numeric())

# Computing all necessary values for preprocessing:
recipe.prepared <- prep(recipe, training = data)

# Computing the difference between max and min column values for each column
mean <- 
  data.log10.transformed %>%    # We use log10-transformed data here
  select(-`Sample Name`,       # We remove `Sample Name` and `Label` for these computations
         -`Label`) %>%
  sapply(function(x) mean(x))
  
# Changing the sds for 'mean' vector:
recipe.prepared$steps[[3]]$sds <- mean

# Baking the recipe:
data.transformed.centered.Level.scaled <- bake(recipe.prepared, new_data = data)

print(data.transformed.centered.Level.scaled)

The output:

And finally the Vast scaling:

# Data scaling via recipes
# 5. Vast scaling (according to the formula from the manuscript)
# Generating recipe
recipe <- recipe(Label ~ ., data = data)

# Defining pre-processing steps
recipe <- recipe %>% 
  step_log(all_numeric(), base = 10) %>%
  step_center(all_numeric()) %>% # We need to center the data separately in recipes
  step_scale(all_numeric())

# Computing all necessary values for preprocessing:
recipe.prepared <- prep(recipe, training = data)

# Computing the difference between max and min column values for each column
vast <- 
  data.log10.transformed %>%    # We use log10-transformed data here
  select(-`Sample Name`,       # We remove `Sample Name` and `Label` for these computations
         -`Label`) %>%
  sapply(function(x) sd(x)^2/mean(x))

# Changing the sds for 'mean' vector:
recipe.prepared$steps[[3]]$sds <- vast

# Baking the recipe:
data.transformed.centered.Vast.scaled <- bake(recipe.prepared, new_data = data)

print(data.transformed.centered.Vast.scaled)

The output:

PreviousData transformation and scaling using mutate()NextData Normalization – bestNormalize R package

Last updated 1 year ago