💪
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  • Data imputation using a selected constant
  • Data imputation by a constant for MNAR values (percentage of lowest concentration in a column)
  • Data imputation using mean or median value
  1. Missing values handling in R
  2. Data imputation by different available R libraries

Basic data imputation in R with dplyr and tidyr (tidyverse)

A part of missing value – data imputation section

PreviousData imputation by different available R librariesNextData imputation using recipes library (tidymodels)

Last updated 3 months ago

The chapter shows basic data imputation in R, including imputation by a selected constant in the entire data set, column-wise imputation by a value close to LOQ, and column-wise imputation by a minimum column value, mean, or median.

The imputation by the percentage of the lowest concentration in a column and by zero value can be used for MNAR values, while mean and median value imputations may be applied to the imputation of missing clinical data or MCAR (MAR).

Except for the mutate() function and its variant mutate_if(), we need two functions from the tidyr library (): drop_na() and replace_na(). The tidyr package also offers fill()function, but we will not use it in this case. Both libraries (dplyr, tidyr) belong to the tidyverse collection.

First, we will call the tidyverse collection. We assume that you have already completed the installation as explained in the previous chapters; otherwise, it is necessary to install the package first.

# Basic data imputation methods in R
# Calling tidyverse collection
library(tidyverse)

As you probably realize, the drop_na() function is used to drop rows containing missing entries. Although simple, it is not a good solution for most metabolomics and lipidomics data sets as it could result in losing precious results. Therefore, we do not advise this option. Its application in R is very simple:

# Droping rows containing missing values
data.without.NA <-
  data.missing %>%
  drop_na()

# Explanation:
# Take 'data.missing' from the global environment, 
# Push it through the pipe to:
# drop_na()
# Store the new data frame as 'data.without.NA' in the global environment.

In the case of our example data set, using drop_na(), we delete all rows as every row contains missing values, so the final tibble data.without.NA contains no data. Therefore, we cannot use this method. We will need to rely on a different approach, and in most cases, we will use the replace_na() function. Its construction is very simple and could be presented in the following way:

# The construction of replace_na():
replace_na(variable name, how would you like substitute NAs)

Data imputation using a selected constant

We begin with imputing missing values by a constant as the simplest example. First, we need to show you how to substitute missing entries in one column and next - across all columns. In this case, we rely on mutate() and replace_na(). See the example below:

# Data imputation by a constant:
# EXAMPLE 1a - Using '0' value to substitute NAs in one column - `CE 16:1`:
data.imputed.1a <-
  data.missing %>%
  mutate(`CE 16:1` = replace_na(`CE 16:1`, 0))

# Explanation:
# Take 'data.missing' from the global environment, 
# Push it through the pipe to:
# mutate() `CE 16:1` column and replace missing values with 0.
# Store the new data frame as 'data.imputed.1a' in the global environment.

print(data.imputed.1a)

We merge mutate() with replace_na() from the tidyr library to remove empty entries. The output of this function is presented below:

Remember: The code block above presents nothing else but the application of the regular mutate() function. This function is handy for the preprocessing of metabolomics/lipidomics data because they are handled via data frames (tibbles) in R. Using mutate(), we can work in a simple way with the columns of data frames (tibbles). Through mutate(), we can, for example, impute missing entries, transform values, delete/create columns in our metabolomics/lipidomics data sets, and perform many other operations.

However, substituting missing entries column-by-column would be time-consuming and inefficient. Using across() and where(), all missing entries can be substituted by 0 at once. See the next code block:

# EXAMPLE 1b: Using '0' value to substitute NAs in all numeric columns.
data.imputed.1b.v1 <-
  data.missing %>%
  mutate(across(where(is.numeric), ~replace_na(., 0)))

print(data.imputed.1b.v1)

# Explanation:
# Take 'data.missing' from the global environment, 
# Push it through the pipe to:
# mutate() across columns where is.numeric returns TRUE:
# for all these columns using replace_na() function...
# ...replace NA entries with 0.
# Store the new data frame as 'data.imputed.1b.v1' in the global environment.

NOTE: Here, a new important shorthand appears called tilde-dot: '~' and '.' symbols. These symbols substitute this long line of code:

# Without tilde-dot:
data.imputed.1b.v1 <-
  data %>%
  mutate(across(where(is.numeric), function(.){replace_na(., 0)}))   

'~' refers to function(...), and any of these symbols '.', '.x', '..1', refers to every column fulfilling condition is.numeric.

# With tilde-dot
data.imputed.1b.v1 <-
  data.missing %>%
  mutate(across(where(is.numeric), ~replace_na(., 0)))

This shorthand is often used in the case of anonymous functions, which are functions created in arguments of other functions. These are also known as lambdas. The anonymous function is a function definition not connected to a name. An example could be:

# Anonymous function example:
(function(x) {x+2})(x = c(2,4))

This anonymous function adds 2 to every x from the vector c(2,4). In effect, we obtain:

> # Anonymous function example:
> (function(x) {x+2})(x = c(2,4))
[1] 4 6

We could also substitute NAs with 0 using the following lines of code:

# Data imputation with mutate() and different shorthands:
data.imputed.1b.v2 <-
  data.missing %>%
  mutate(across(where(is.numeric), ~replace_na(.x, 0)))

print(data.imputed.1b.v2)

# or
data.imputed.1b.v3 <-
  data.missing %>%
  mutate(across(where(is.numeric), ~replace_na(..1, 0)))

print(data.imputed.1b.v3)

Moreover, to simplify our code further, instead of mutate(), we can use mutate_if(). The mutate_if() function works with predicate functions (these which return TRUE or FALSE, e.g., is.numeric). Using mutate_if(), we can select all numeric columns without across() and where(). Code is easier to read and shorter:

# Using mutate_if() for the data imputation:
data.imputed.1b.v4 <-
  data.missing %>%
  mutate_if(is.numeric, ~replace_na(., 0))

print(data.imputed.1b.v4)

All these examples from above lead to only one result:

Of course, any constant value other than 0 could also be used in all these cases.

Data imputation by a constant for MNAR values (percentage of lowest concentration in a column)

Replacing missing values with 0 can significantly influence data architecture by shifting the mean and median. Generally, in OMICs data analysis, substituting many missing entries with a constant can harm data architecture. This approach can be used only if a few missing entries occur in the whole data set. A good example is the situation when NAs are present mainly in single columns corresponding to low-abundant lipids or metabolites, where it is possible that the concentration could not be determined because of reaching the limit of quantitation or even the limit of detection. In such a case, assuming that the concentration of a feature in some of the samples is <LOQ, we could, for example, use 80% of the lowest concentration in every column to replace missing values. Although this approach can introduce additional bias to the data (we use observed variables to impute the unobserved variables), it is often applied to metabolomics and lipidomics data.

We will use a constant value, but it will be specific for every feature and, therefore - different for every column. We need to modify our code for this imputation. See the code block below:

# EXAMPLE 2a
# Using a percentage of the lowest concentration in every column to replace NAs.
# Applicable if the NAs occur predominantly within single columns with low-abundant lipids/metabolites.
# In such a case, it is possible that the concentration was <LOQ and...
# ...we can assume replacing NAs with 80% of the lowest concentration is reasonable.
  data.imputed.2a <-
  data.missing %>%
  mutate_if(is.numeric, ~replace_na(., 0.8*min(., na.rm = T)))
  
# Explanation:
# Take 'data.missing' from the global environment, 
# Push it through the pipe to:
# mutate_if(). If is.numeric returns TRUE:
# for all these columns using replace_na() function...
# ...replace NA entries with 0.8*minimum_concentration_in_every_column...
# ...and for these computations, skip missing values: na.rm MUST BE set to TRUE (na.rm = T).
# This last argument is IMPORTANT, as the default setting is FALSE.
# The replacement won't happen if na.rm = F, as R will be confused!
# Store the new data frame as 'data.imputed.2a' in the global environment.

print(data.imputed.2a)

Note!

Again, we highlight the importance of setting na.rm to TRUE. In the example above it concerns the min() base R function! Otherwise, the missing values will not be imputed!

These lines of code produce the following output:

If we would like to apply substitution by a minimum column value, a simple modification of the code is necessary:

# EXAMPLE 2b
# Using the lowest concentration in every column to replace NAs.
# Applicable if the NAs occur predominantly within single columns with low-abundant lipids/metabolites.
  data.imputed.2b <-
  data.missing %>%
  mutate_if(is.numeric, ~replace_na(., min(., na.rm = T)))
  
# Explanation:
# Take 'data.missing' from the global environment, 
# Push it through the pipe to:
# mutate_if(). If is.numeric returns TRUE:
# for all these columns using replace_na() function...
# ...replace NA entries with the minimum concentration of every column...
# ...and for these computations, skip missing values: na.rm MUST BE set to TRUE (na.rm = T).
# Store the new data frame as 'data.imputed.2b' in the global environment.

print(data.imputed.2b)

Note!

Again, we highlight the importance of setting na.rm to TRUE. In the example above it concerns the min() base R function! Otherwise, the missing values will not be imputed!

And the output:

Data imputation using mean or median value

For missing clinical entries or MCAR (MAR) values among numerous lipids, imputing with the median or mean could be a better approach, as it is less likely to disrupt the data structure significantly. This ensures that the means (or medians) remain relatively unchanged, as opposed to when they are replaced with zeros or the lowest concentration percentage. However, it is important always to investigate the cause of missing data and, if possible, remeasure or reprocess the data to prevent the risk of "creating data."

Please see the code block below for replacing missing entries in columns by the mean and median values, respectively:

# EXAMPLE 3a
# Using the mean value in every column to replace NAs.
  data.imputed.3a <-
  data.missing %>%
  mutate_if(is.numeric, ~replace_na(., mean(., na.rm = T)))
  
# Explanation:
# Take 'data.missing' from the global environment, 
# Push it through the pipe to:
# mutate_if(). If is.numeric returns TRUE:
# for all these columns using replace_na() function...
# ...replace NA entries with the mean concentration of every column...
# ...and for these computations, skip missing values: na.rm MUST BE set to TRUE (na.rm = T).
# Store the new data frame as 'data.imputed.3a' in the global environment.

print(data.imputed.3a)

# EXAMPLE 3b
# Using the median value in every column to replace NAs.
  data.imputed.3b <-
  data.missing %>%
  mutate_if(is.numeric, ~replace_na(., median(., na.rm=T)))
  
# Explanation:
# Take 'data.missing' from the global environment, 
# Push it through the pipe to:
# mutate_if(). If is.numeric returns TRUE:
# for all these columns using replace_na() function...
# ...replace NA entries with the median concentration of every column...
# ...and for these computations, skip missing values: na.rm MUST BE set to TRUE (na.rm = T).
# Store the new data frame as 'data.imputed.3b' in the global environment.

print(data.imputed.3b)

Note!

Again, we highlight the importance of setting na.rm to TRUE. In the example above it concerns the mean(), median() base R function! Otherwise, the missing values will not be imputed!

The outputs from these lines of code:

All code blocks gathered into one script can be downloaded here:

https://tidyr.tidyverse.org/
6KB
Basic data imputation in R.R
Script containing all code blocks for basic data imputation in R using mutate() and replace_na().
Substituting missing concentrations in first column by 0 using mutate() and replace_na().
Substituting missing values with 0 in all columns containing NA entries using mutate_if() and replace_na().
Substituting NAs via 80% of lowest concentration in every column using mutate_if() and replace_na() functions.
Missing data imputation using lowest concentration value in every column with mutate_if() and replace_na().
EXAMPLE 3b: Missing data imputation via mean column concentration using mutate_if() and replace_na() functions.
EXAMPLE 3b: Missing data imputation via median column concentration using mutate_if() and replace_na() functions.