Basic data imputation in R with dplyr and tidyr (tidyverse)

A part of missing value – data imputation section

The chapter shows basic data imputation in R, including imputation by a selected constant in the entire data set, column-wise imputation by a value close to LOQ, and column-wise imputation by a minimum column value, mean, or median.

The imputation by the percentage of the lowest concentration in a column and by zero value can be used for MNAR values, while mean and median value imputations may be applied to the imputation of missing clinical data or MCAR (MAR).

Except for the mutate() function and its variant mutate_if(), we need two functions from the tidyr library (https://tidyr.tidyverse.org/): drop_na() and replace_na(). The tidyr package also offers fill()function, but we will not use it in this case. Both libraries (dplyr, tidyr) belong to the tidyverse collection.

First, we will call the tidyverse collection. We assume that you have already completed the installation as explained in the previous chapters; otherwise, it is necessary to install the package first.

# Basic data imputation methods in R
# Calling tidyverse collection
library(tidyverse)

As you probably realize, the drop_na() function is used to drop rows containing missing entries. Although simple, it is not a good solution for most metabolomics and lipidomics data sets as it could result in losing precious results. Therefore, we do not advise this option. Its application in R is very simple:

# Droping rows containing missing values
data.without.NA <-
  data.missing %>%
  drop_na()

# Explanation:
# Take 'data.missing' from the global environment, 
# Push it through the pipe to:
# drop_na()
# Store the new data frame as 'data.without.NA' in the global environment.

In the case of our example data set, using drop_na(), we delete all rows as every row contains missing values, so the final tibble data.without.NA contains no data. Therefore, we cannot use this method. We will need to rely on a different approach, and in most cases, we will use the replace_na() function. Its construction is very simple and could be presented in the following way:

# The construction of replace_na():
replace_na(variable name, how would you like substitute NAs)

Data imputation using a selected constant

We begin with imputing missing values by a constant as the simplest example. First, we need to show you how to substitute missing entries in one column and next - across all columns. In this case, we rely on mutate() and replace_na(). See the example below:

# Data imputation by a constant:
# EXAMPLE 1a - Using '0' value to substitute NAs in one column - `CE 16:1`:
data.imputed.1a <-
  data.missing %>%
  mutate(`CE 16:1` = replace_na(`CE 16:1`, 0))

# Explanation:
# Take 'data.missing' from the global environment, 
# Push it through the pipe to:
# mutate() `CE 16:1` column and replace missing values with 0.
# Store the new data frame as 'data.imputed.1a' in the global environment.

print(data.imputed.1a)

We merge mutate() with replace_na() from the tidyr library to remove empty entries. The output of this function is presented below:

Remember: The code block above presents nothing else but the application of the regular mutate() function. This function is handy for the preprocessing of metabolomics/lipidomics data because they are handled via data frames (tibbles) in R. Using mutate(), we can work in a simple way with the columns of data frames (tibbles). Through mutate(), we can, for example, impute missing entries, transform values, delete/create columns in our metabolomics/lipidomics data sets, and perform many other operations.

However, substituting missing entries column-by-column would be time-consuming and inefficient. Using across() and where(), all missing entries can be substituted by 0 at once. See the next code block:

# EXAMPLE 1b: Using '0' value to substitute NAs in all numeric columns.
data.imputed.1b.v1 <-
  data.missing %>%
  mutate(across(where(is.numeric), ~replace_na(., 0)))

print(data.imputed.1b.v1)

# Explanation:
# Take 'data.missing' from the global environment, 
# Push it through the pipe to:
# mutate() across columns where is.numeric returns TRUE:
# for all these columns using replace_na() function...
# ...replace NA entries with 0.
# Store the new data frame as 'data.imputed.1b.v1' in the global environment.

NOTE: Here, a new important shorthand appears called tilde-dot: '~' and '.' symbols. These symbols substitute this long line of code:

# Without tilde-dot:
data.imputed.1b.v1 <-
  data %>%
  mutate(across(where(is.numeric), function(.){replace_na(., 0)}))

'~' refers to function(...), and any of these symbols '.', '.x', '..1', refers to every column fulfilling condition is.numeric.

# With tilde-dot
data.imputed.1b.v1 <-
  data.missing %>%
  mutate(across(where(is.numeric), ~replace_na(., 0)))

This shorthand is often used in the case of anonymous functions, which are functions created in arguments of other functions. These are also known as lambdas. The anonymous function is a function definition not connected to a name. An example could be:

# Anonymous function example:
(function(x) {x+2})(x = c(2,4))

This anonymous function adds 2 to every x from the vector c(2,4). In effect, we obtain:

> # Anonymous function example:
> (function(x) {x+2})(x = c(2,4))
[1] 4 6

We could also substitute NAs with 0 using the following lines of code:

# Data imputation with mutate() and different shorthands:
data.imputed.1b.v2 <-
  data.missing %>%
  mutate(across(where(is.numeric), ~replace_na(.x, 0)))

print(data.imputed.1b.v2)

# or
data.imputed.1b.v3 <-
  data.missing %>%
  mutate(across(where(is.numeric), ~replace_na(..1, 0)))

print(data.imputed.1b.v3)

Moreover, to simplify our code further, instead of mutate(), we can use mutate_if(). The mutate_if() function works with predicate functions (these which return TRUE or FALSE, e.g., is.numeric). Using mutate_if(), we can select all numeric columns without across() and where(). Code is easier to read and shorter:

# Using mutate_if() for the data imputation:
data.imputed.1b.v4 <-
  data.missing %>%
  mutate_if(is.numeric, ~replace_na(., 0))

print(data.imputed.1b.v4)

All these examples from above lead to only one result:

Of course, any constant value other than 0 could also be used in all these cases.

Data imputation by a constant for MNAR values (percentage of lowest concentration in a column)

Replacing missing values with 0 can significantly influence data architecture by shifting the mean and median. Generally, in OMICs data analysis, substituting many missing entries with a constant can harm data architecture. This approach can be used only if a few missing entries occur in the whole data set. A good example is the situation when NAs are present mainly in single columns corresponding to low-abundant lipids or metabolites, where it is possible that the concentration could not be determined because of reaching the limit of quantitation or even the limit of detection. In such a case, assuming that the concentration of a feature in some of the samples is <LOQ, we could, for example, use 80% of the lowest concentration in every column to replace missing values. Although this approach can introduce additional bias to the data (we use observed variables to impute the unobserved variables), it is often applied to metabolomics and lipidomics data.

We will use a constant value, but it will be specific for every feature and, therefore - different for every column. We need to modify our code for this imputation. See the code block below:

# EXAMPLE 2a
# Using a percentage of the lowest concentration in every column to replace NAs.
# Applicable if the NAs occur predominantly within single columns with low-abundant lipids/metabolites.
# In such a case, it is possible that the concentration was <LOQ and...
# ...we can assume replacing NAs with 80% of the lowest concentration is reasonable.
  data.imputed.2a <-
  data.missing %>%
  mutate_if(is.numeric, ~replace_na(., 0.8*min(., na.rm = T)))
  
# Explanation:
# Take 'data.missing' from the global environment, 
# Push it through the pipe to:
# mutate_if(). If is.numeric returns TRUE:
# for all these columns using replace_na() function...
# ...replace NA entries with 0.8*minimum_concentration_in_every_column...
# ...and for these computations, skip missing values: na.rm MUST BE set to TRUE (na.rm = T).
# This last argument is IMPORTANT, as the default setting is FALSE.
# The replacement won't happen if na.rm = F, as R will be confused!
# Store the new data frame as 'data.imputed.2a' in the global environment.

print(data.imputed.2a)

Note!

Again, we highlight the importance of setting na.rm to TRUE. In the example above it concerns the min() base R function! Otherwise, the missing values will not be imputed!

These lines of code produce the following output:

If we would like to apply substitution by a minimum column value, a simple modification of the code is necessary:

# EXAMPLE 2b
# Using the lowest concentration in every column to replace NAs.
# Applicable if the NAs occur predominantly within single columns with low-abundant lipids/metabolites.
  data.imputed.2b <-
  data.missing %>%
  mutate_if(is.numeric, ~replace_na(., min(., na.rm = T)))
  
# Explanation:
# Take 'data.missing' from the global environment, 
# Push it through the pipe to:
# mutate_if(). If is.numeric returns TRUE:
# for all these columns using replace_na() function...
# ...replace NA entries with the minimum concentration of every column...
# ...and for these computations, skip missing values: na.rm MUST BE set to TRUE (na.rm = T).
# Store the new data frame as 'data.imputed.2b' in the global environment.

print(data.imputed.2b)

Note!

Again, we highlight the importance of setting na.rm to TRUE. In the example above it concerns the min() base R function! Otherwise, the missing values will not be imputed!

And the output:

Data imputation using mean or median value

For missing clinical entries or MCAR (MAR) values among numerous lipids, imputing with the median or mean could be a better approach, as it is less likely to disrupt the data structure significantly. This ensures that the means (or medians) remain relatively unchanged, as opposed to when they are replaced with zeros or the lowest concentration percentage. However, it is important always to investigate the cause of missing data and, if possible, remeasure or reprocess the data to prevent the risk of "creating data."

Please see the code block below for replacing missing entries in columns by the mean and median values, respectively:

# EXAMPLE 3a
# Using the mean value in every column to replace NAs.
  data.imputed.3a <-
  data.missing %>%
  mutate_if(is.numeric, ~replace_na(., mean(., na.rm = T)))
  
# Explanation:
# Take 'data.missing' from the global environment, 
# Push it through the pipe to:
# mutate_if(). If is.numeric returns TRUE:
# for all these columns using replace_na() function...
# ...replace NA entries with the mean concentration of every column...
# ...and for these computations, skip missing values: na.rm MUST BE set to TRUE (na.rm = T).
# Store the new data frame as 'data.imputed.3a' in the global environment.

print(data.imputed.3a)

# EXAMPLE 3b
# Using the median value in every column to replace NAs.
  data.imputed.3b <-
  data.missing %>%
  mutate_if(is.numeric, ~replace_na(., median(., na.rm=T)))
  
# Explanation:
# Take 'data.missing' from the global environment, 
# Push it through the pipe to:
# mutate_if(). If is.numeric returns TRUE:
# for all these columns using replace_na() function...
# ...replace NA entries with the median concentration of every column...
# ...and for these computations, skip missing values: na.rm MUST BE set to TRUE (na.rm = T).
# Store the new data frame as 'data.imputed.3b' in the global environment.

print(data.imputed.3b)

Note!

Again, we highlight the importance of setting na.rm to TRUE. In the example above it concerns the mean(), median() base R function! Otherwise, the missing values will not be imputed!

The outputs from these lines of code:

All code blocks gathered into one script can be downloaded here:

PreviousData imputation by different available R libraries NextData imputation using recipes library (tidymodels)

Last updated 4 months ago