Data transformation and scaling using recipes R package
A part of data transformation & normalization β using available R packages
In the field of machine learning, the term "recipe" commonly denotes a series of procedures or transformations executed on a data set prior to model training. This critical step, not only in OMICs analysis, ensures that our data are well-prepared, allowing the model to discern patterns and generate precise predictions. Analogous to the meticulous crafting of a culinary masterpiece, the 'recipes' package in R (https://recipes.tidymodels.org/) is a versatile tool for orchestrating these preparatory data transformations. A chef carefully selects and combines ingredients to achieve a delectable dish, and data scientists employ the 'recipes' package to methodically craft a tailored data preprocessing plan, enhancing the effectiveness of subsequent modeling endeavors.
The recipes R package for metabolomics and lipidomics data preprocessing in machine learning also offers data transformation and scaling functions. Here, we will show how to transform the entire data set and then scale it via recipes. We will start with log and square root transformations. Similarly to the data imputation, we need to repeat four steps for the transformation of our metabolomics/lipidomics data:
generate a recipe,
define steps for data pre-processing,
prepare the recipe (calculate all values necessary for pre-processing),
and bake it (execute it for the data set).
We need to call tidymodels library before starting:
# Calling tidymodels library
library(tidymodels)Data transformation with recipes
# Data transformation via recipes
# 1. Log transformation
# Generating recipe
recipe <- recipe(Label ~ ., data = data)
# Defining pre-processing steps
recipe <-
recipe %>%
step_log(all_numeric()) # log() transformation, base = exp(1)
# Computing all necessary values for preprocessing:
recipe.prepared <- prep(recipe, training = data)
# Executing recipe:
data.log.transformed <- bake(recipe.prepared, new_data = data)
print(data.log.transformed)The output:

The step_log() function contains argument base, so we can also perform log10, log2, log5 transformation:
A simple change in this code allows for the square-root transformation:
And the output:

Data centering with recipes
Based on similar principles to the imputation and transformation, the data can also be centered using recipes:
And the centered data:

Data scaling with recipes
Our data can be Auto-, Pareto-, Range, Level, and Vast-scaled via recipes. However, the recipe-based method for data scaling, except for Autoscaling, is more complicated than if using mutate_if() function. We will explain to you here scaling via recipes library. First, we will start with Autoscaling, which is easily performed via step_scale() function:
The output:

Unfortunately, recipes do not allow for a direct Pareto scaling of the numeric values, e.g., via step_pareto() function or by modifying the arguments of step_scale() function. The same applies to Range, Level, and Vast scaling. If you check the object recipe.prepared, it is a list that stores more lists containing all the information required for the data preprocessing. If we access elements of these lists, we will reach the vector with standard deviations applied to scale data (for Autoscaling):

If we change this vector for a vector containing the square root of standard deviations, we can perform Pareto scaling. The square root of the standard deviation for every column of our tibble with lipidomics data can be again computed using sapply() function.
The output:

If we compare the code from above to scaling via mutate_if(), the second method is significantly simpler.
Range scaling in the understanding of recipes is performed differently than how we presented it earlier. Therefore, we will just mention it here briefly. Recipes contain step_range() function. The step_range() function contains two arguments min and max, where min and max values must be introduced, which will be used to calculate the range. By default, this value is set to 0 and 1.
The classic Range scaling as presented by Robert A. van den Berg et al. in their manuscript could be performed again by substituting sds values in the 'recipe.prepared' list with the difference between maximum as minimum values, computed for each column:
The final output:

We could do Level scaling, using the same 'trick':
The output:

And finally the Vast scaling:
The output:

Last updated