Data transformation and scaling using recipes R package

A part of data transformation & normalization – using available R packages

In the field of machine learning, the term "recipe" commonly denotes a series of procedures or transformations executed on a data set prior to model training. This critical step, not only in OMICs analysis, ensures that our data are well-prepared, allowing the model to discern patterns and generate precise predictions. Analogous to the meticulous crafting of a culinary masterpiece, the 'recipes' package in R (https://recipes.tidymodels.org/) is a versatile tool for orchestrating these preparatory data transformations. A chef carefully selects and combines ingredients to achieve a delectable dish, and data scientists employ the 'recipes' package to methodically craft a tailored data preprocessing plan, enhancing the effectiveness of subsequent modeling endeavors.

The recipes R package for metabolomics and lipidomics data preprocessing in machine learning also offers data transformation and scaling functions. Here, we will show how to transform the entire data set and then scale it via recipes. We will start with log and square root transformations. Similarly to the data imputation, we need to repeat four steps for the transformation of our metabolomics/lipidomics data:

  • generate a recipe,

  • define steps for data pre-processing,

  • prepare the recipe (calculate all values necessary for pre-processing),

  • and bake it (execute it for the data set).

We need to call tidymodels library before starting:

# Calling tidymodels library
library(tidymodels)

Data transformation with recipes

# Data transformation via recipes
# 1. Log transformation
# Generating recipe
recipe <- recipe(Label ~ ., data = data)

# Defining pre-processing steps
recipe <- 
  recipe %>% 
  step_log(all_numeric()) # log() transformation, base = exp(1)

# Computing all necessary values for preprocessing:
recipe.prepared <- prep(recipe, training = data)

# Executing recipe:
data.log.transformed <- bake(recipe.prepared, new_data = data)

print(data.log.transformed)

The output:

Data transformation with the natural logarithm via recipes (tidymodels).

The step_log() function contains argument base, so we can also perform log10, log2, log5 transformation:

A simple change in this code allows for the square-root transformation:

And the output:

Data transformation with the square root via recipes (tidymodels).

Data centering with recipes

Based on similar principles to the imputation and transformation, the data can also be centered using recipes:

And the centered data:

Data centered via recipes (tidymodels).

Data scaling with recipes

Our data can be Auto-, Pareto-, Range, Level, and Vast-scaled via recipes. However, the recipe-based method for data scaling, except for Autoscaling, is more complicated than if using mutate_if() function. We will explain to you here scaling via recipes library. First, we will start with Autoscaling, which is easily performed via step_scale() function:

The output:

Data log-transformed, centered, and scaled via recipes (tidymodels).

Unfortunately, recipes do not allow for a direct Pareto scaling of the numeric values, e.g., via step_pareto() function or by modifying the arguments of step_scale() function. The same applies to Range, Level, and Vast scaling. If you check the object recipe.prepared, it is a list that stores more lists containing all the information required for the data preprocessing. If we access elements of these lists, we will reach the vector with standard deviations applied to scale data (for Autoscaling):

Analyzing the content of the 'recipe.prepared' list. Standard deviations that our recipe will use for data scaling.

If we change this vector for a vector containing the square root of standard deviations, we can perform Pareto scaling. The square root of the standard deviation for every column of our tibble with lipidomics data can be again computed using sapply() function.

The output:

Data log10-transformed, centered, and Pareto-scaled via recipes (tidymodels).

If we compare the code from above to scaling via mutate_if(), the second method is significantly simpler.

Range scaling in the understanding of recipes is performed differently than how we presented it earlier. Therefore, we will just mention it here briefly. Recipes contain step_range() function. The step_range() function contains two arguments min and max, where min and max values must be introduced, which will be used to calculate the range. By default, this value is set to 0 and 1.

The classic Range scaling as presented by Robert A. van den Berg et al. in their manuscript could be performed again by substituting sds values in the 'recipe.prepared' list with the difference between maximum as minimum values, computed for each column:

The final output:

Data log10-transformed, centered, and Range-scaled via recipes (tidymodels).

We could do Level scaling, using the same 'trick':

The output:

Data log10-transformed, centered, and Level-scaled via recipes (tidymodels).

And finally the Vast scaling:

The output:

Data log10-transformed, centered, and Vast-scaled via recipes (tidymodels).

Last updated