Basic data imputation in R with dplyr and tidyr (tidyverse)
A part of missing value – data imputation section
Last updated
A part of missing value – data imputation section
Last updated
The chapter shows basic data imputation in R, including imputation by a selected constant in the entire data set, column-wise imputation by a value close to LOQ, and column-wise imputation by a minimum column value, mean, or median.
The imputation by the percentage of the lowest concentration in a column and by zero value can be used for MNAR values, while mean and median value imputations may be applied to the imputation of missing clinical data or MCAR (MAR).
Except for the mutate()
function and its variant mutate_if()
, we need two functions from the tidyr library (): drop_na()
and replace_na()
. The tidyr package also offers fill()
function, but we will not use it in this case. Both libraries (dplyr, tidyr) belong to the tidyverse collection.
First, we will call the tidyverse collection. We assume that you have already completed the installation as explained in the previous chapters; otherwise, it is necessary to install the package first.
As you probably realize, the drop_na()
function is used to drop rows containing missing entries. Although simple, it is not a good solution for most metabolomics and lipidomics data sets as it could result in losing precious results. Therefore, we do not advise this option. Its application in R is very simple:
In the case of our example data set, using drop_na()
, we delete all rows as every row contains missing values, so the final tibble data.without.NA
contains no data. Therefore, we cannot use this method. We will need to rely on a different approach, and in most cases, we will use the replace_na()
function. Its construction is very simple and could be presented in the following way:
We begin with imputing missing values by a constant as the simplest example. First, we need to show you how to substitute missing entries in one column and next - across all columns. In this case, we rely on mutate()
and replace_na()
. See the example below:
We merge mutate()
with replace_na()
from the tidyr library to remove empty entries. The output of this function is presented below:
Remember: The code block above presents nothing else but the application of the regular mutate()
function. This function is handy for the preprocessing of metabolomics/lipidomics data because they are handled via data frames (tibbles) in R. Using mutate()
, we can work in a simple way with the columns of data frames (tibbles). Through mutate()
, we can, for example, impute missing entries, transform values, delete/create columns in our metabolomics/lipidomics data sets, and perform many other operations.
However, substituting missing entries column-by-column would be time-consuming and inefficient. Using across()
and where()
, all missing entries can be substituted by 0 at once. See the next code block:
NOTE: Here, a new important shorthand appears called tilde-dot: '~' and '.' symbols. These symbols substitute this long line of code:
'~' refers to function(...), and any of these symbols '.', '.x', '..1', refers to every column fulfilling condition is.numeric.
This shorthand is often used in the case of anonymous functions, which are functions created in arguments of other functions. These are also known as lambdas. The anonymous function is a function definition not connected to a name. An example could be:
This anonymous function adds 2 to every x from the vector c(2,4). In effect, we obtain:
We could also substitute NAs with 0 using the following lines of code:
Moreover, to simplify our code further, instead of mutate()
, we can use mutate_if()
. The mutate_if()
function works with predicate functions (these which return TRUE or FALSE, e.g., is.numeric). Using mutate_if()
, we can select all numeric columns without across()
and where()
. Code is easier to read and shorter:
All these examples from above lead to only one result:
Of course, any constant value other than 0 could also be used in all these cases.
Replacing missing values with 0 can significantly influence data architecture by shifting the mean and median. Generally, in OMICs data analysis, substituting many missing entries with a constant can harm data architecture. This approach can be used only if a few missing entries occur in the whole data set. A good example is the situation when NAs are present mainly in single columns corresponding to low-abundant lipids or metabolites, where it is possible that the concentration could not be determined because of reaching the limit of quantitation or even the limit of detection. In such a case, assuming that the concentration of a feature in some of the samples is <LOQ, we could, for example, use 80% of the lowest concentration in every column to replace missing values. Although this approach can introduce additional bias to the data (we use observed variables to impute the unobserved variables), it is often applied to metabolomics and lipidomics data.
We will use a constant value, but it will be specific for every feature and, therefore - different for every column. We need to modify our code for this imputation. See the code block below:
Note!
Again, we highlight the importance of setting na.rm to TRUE. In the example above it concerns the min()
base R function! Otherwise, the missing values will not be imputed!
These lines of code produce the following output:
If we would like to apply substitution by a minimum column value, a simple modification of the code is necessary:
Note!
Again, we highlight the importance of setting na.rm to TRUE. In the example above it concerns the min()
base R function! Otherwise, the missing values will not be imputed!
And the output:
For missing clinical entries or MCAR (MAR) values among numerous lipids, imputing with the median or mean could be a better approach, as it is less likely to disrupt the data structure significantly. This ensures that the means (or medians) remain relatively unchanged, as opposed to when they are replaced with zeros or the lowest concentration percentage. However, it is important always to investigate the cause of missing data and, if possible, remeasure or reprocess the data to prevent the risk of "creating data."
Please see the code block below for replacing missing entries in columns by the mean and median values, respectively:
Note!
Again, we highlight the importance of setting na.rm to TRUE. In the example above it concerns the mean(), median()
base R function! Otherwise, the missing values will not be imputed!
The outputs from these lines of code:
All code blocks gathered into one script can be downloaded here: