Data imputation using recipes library (tidymodels)
A part of missing value – data imputation section
The tidymodels collection provides the recipes package containing tools for data preprocessing, including data imputation. This package, as a part of a collection dedicated to machine learning, is usually applied while building models. However, it can be an alternative solution to dplyr and tidyr while preprocessing data (see subchapter Basic data imputation in R with dplyr and tidyr). Here, we will briefly introduce recipes to you.
The principle is relatively simple:
A recipe is generated.
You define the pre-processing workflow within the recipe you created.
The recipe is prepared to be applied for pre-processing (computing all parameters).
The recipe is 'baked', i.e. used to perform pre-processing with all parameters computed in step 3.
Let's change these steps into R code. This time we will need the tidymodels collection:
To build a recipe, we will need to define response variable(s), predictors, and the data set that is the source of this information (the recipe needs the name and types of data). The response variable is a column in our tibble defining the biological group patients were assigned based on the diagnosis. Predictors (the variables used for predicting the response) are, in our case, all lipid concentrations that we measured. To produce a recipe, we write the following line of code:
NOTE: Here, you see another application of the '~' (tilde) symbol in R. It separates the left side (response variable(s)) from the right side, which usually defines predictors (lipids or metabolites) while building statistical models or performing hypothesis testing. For selected predictors, we use (Label
~ CE 16:1
+ CE 16:0
+ CE 18:2
+ ...). If we use the dot instead of the names of selected lipids or metabolites (Label
~ .
), we indicate that predictors are all remaining variables.
If executing this line of code from above creates in your global environment a list named 'recipe', we can move to the next step. Now, we need to define what preprocessing should be performed using our recipe:
If all pre-processing options are selected, we need to 'equip' our recipe with all the parameters it needs for our pre-processing. For example, for data imputation by mean, our recipe will need the mean of every numeric column in the 'data.missing' tibble. This is exactly what happens at this step. Using data delivered as a 'training' argument, all necessary parameters are prepared by recipe, so that they can be used in the next step. Here is the necessary line of code:
As our recipe is ready, we can bake it, meaning - apply it to our data set:
The bake()
function contains new_data
argument. If a model is built, usually data are split into train and test sets. These sets can be pre-processed independently using the same recipe, as you could supply them separately to the bake()
function through the new_data argument.
The R console after executing the code from above:
We obtain the same output as through mutate_if()
and replace_na()
:
Last updated