Replacing NAs via random forest (RF) model (randomForest library)

A part of missing value – data imputation section

Aside from the k-nearest neighbor, the random forest method is yet another approach for imputing missing completely at random (MCAR) and missing at random (MAR) entries in lipidomics and metabolomics data frames. The good performance of RF has been demonstrated, among others, by Wei et al. in their highly cited manuscript, which you can read here:

Wei et al. demonstrated that random forest imputation is effective method in the case of MCAR and MAR.

Here, we will use the rfImpute() function from the randomForest library to substitute MAR values in the data frame we created by removing random numeric entries from the lipidomics data (the data can be downloaded from the "Example data sets", Chapter: Introduction).

# Installation of the randomForest library:
install.packages('randomForest')

# Activate the library
library(randomForest)

# Read about the function of interest rfImpute()
?rfImpute()

# We read the data into R and recheck (adjust) column types:
data.missing <- read_xlsx(file.choose())

# Print the data set in the console:
print(data.missing)

# Adjust the column `Label` to be a factor:
data.missing$Label <- as.factor(data.missing$Label)

# Since random processes are involved here, we need to set a seed for reproducibility:
set.seed(111)

# Imputation of missing values using random forest:
data.imputed.rf <- rfImpute(Label ~ ., data = data.missing[,-1], iter = 10)

# The first argument: Label ~ .
## We want to predict our Label based on all other columns (~ .).

# Second argument: data = data.missing[,-1]
## It's a data frame for imputation: all columns except one <chr> column: 'Sample Name'.

# Third argument: iter = 10
## Here, we specify how many random forests should be built to estimate NAs.
## Because of this, we needed to add the seed above.

# We can print the patched data frame in the console:
print(data.imputed.rf)

# And proceed to the next step.

Last updated