Detecting missing values (DataExplorer R package)

A part of preparing data for statistical analysis and visualization

Many R packages include R functions, which clearly define that inputs should contain no missing values. Therefore, we will show a useful R package for exploratory data analysis before we proceed to data analysis called DataExplorer (https://boxuancui.github.io/DataExplorer/). DataExplorer connects code simplicity with informative outputs.

We will apply DataExplorer to learn about the number and distribution of missing entries in the exemplary data set we created in R for you.

Download Data set no. 2 from the subchapter Exemplary data sets (chapter: Introduction).

Set the working directory, install the DataExplorer library, and load all necessary libraries (DataExplorer, readxl, tidyverse collection). Then, read the data into R as 'data.missing' from the working directory. Recheck if the created object is tibble, and adjust column type if necessary:

# Setting working directory (wd).
# Create a folder containing the Excel sheet with Data set no.2, e.g., on D: drive.
# Indicate this folder as a "dir" in setwd().
# Use: setwd("dir"), e.g.:
setwd("D:/Data analysis")

# Data analysis folder must contain: "Lipidomics_missing_values_EXAMPLE.xlsx".

# Installing DataExplorer library.
install.packages('DataExplorer')
# Remember, the package is installed only once!

# Calling libraries necessary for the inspection
library(DataExplorer)
library(readxl)
library(tidyverse)

# Reading new data into R, checking object type and column types:
data.missing <- read_xlsx(file.choose())
print(data.missing)
data.missing$Label <- as.factor(data.missing$Label)

We can get introduced to the object we created by the introduce() function. The introduce()function creates a tibble with basic information about our data set, and we will store it as 'introduction':

# Getting to know our data using introduce() function from the DataExplorer:
introduction <- introduce(data.missing)
View(introduction)

We obtain the following output:

This tibble can be turned into a plot via plot_intro() function:

# Plotting information on the data set via plot_intro():
plot_intro(data.missing)

We obtain the following chart:

Looking at the chart, we immediately realize we will mostly handle continuous variable columns (98.4%, containing concentrations of lipids in this case), and only 1.6% of discrete columns are represented by the only Label factor column. Another important piece of information is that none of the rows of the tibble is complete, meaning all measured samples contain missing values. We also realize we have 29.5% of missing observations, which constitutes a significant amount. We will try to impute them. Finally, we can take a look at the profiles of missing values using plot_missing() function:

# Inspecting profiles of missing values using plot_missing() from DataExplorer.
# We will inspect missing values' profiles class by class. 
# It will allow us to avoid huge, unreadable output.
# For this purpose, we will use select() with helper starts_with('lipid_class_name').

# EXAMPLE 1: Plotting missing values profiles for LPC, PC, PC O-, and PC P-.
data.missing %>% 
  select(`Sample Name`, 
         `Label`, 
         starts_with('LPC') | starts_with('PC')) %>%
  plot_missing()
  
# Explanation:
# Take 'data.missing' from the global environment.
# Push through the pipe to select(),
# select() columns: `Sample Name`, `Label`, and columns whose names start with strings:
# 'LPC' OR `PC'.
# Push the selected columns to the plot_missing() function from DataExplorer library.

# EXAMPLE 2: Plotting missing values profiles for MG, DG, and TG.
data.missing %>% 
  select(starts_with('MG') | starts_with('DG') | starts_with('TG')) %>%
  plot_missing()
  
# Explanation:
# Take 'data.missing' from the global environment.
# Push through the pipe to select(),
# select() columns whose names start with strings:
# 'MG' OR 'DG' OR 'TG'.
# Push the selected columns to the plot_missing() function from DataExplorer library.

# EXAMPLE 3: Plotting missing values profiles for Cer and SM.
data.missing %>% 
  select(starts_with('Cer') | starts_with('SM')) %>%
  plot_missing()

# Explanation:
# Take 'data.missing' from the global environment.
# Push through the pipe to select(),
# select() columns whose names start with strings:
# 'Cer' OR 'SM'.
# Push the selected columns to the plot_missing() function from DataExplorer library.

# EXAMPLE 4: Plotting missing values profiles for CE. 
# NOTE: after CE we kept space so Cer are not selected.
data.missing %>% 
  select(starts_with('CE ')) %>%
  plot_missing()
  
# Explanation:
# Take 'data.missing' from the global environment.
# Push through the pipe to select(),
# select() columns whose names start with strings:
# 'CE<space>'.
# Push the selected columns to the plot_missing() function from DataExplorer library.

The output from EXAMPLE 1:

It is worth highlighting that these plots have only diagnostic value, and usually, they are not presented in the manuscripts. The output from the DataExplorer also suggests whether a column should be kept and missing values replaced (Band: Good or OK), considered to be removed (Band: Bad), or removed because it contains no data (Band: Removed).

The script containing functions from DataExplorer for the initial inspection of the data set can be downloaded here:

In the upcoming steps, we will demonstrate how to filter out columns with a high percentage of missing values and how to impute NAs in the remaining columns.

PreviousMissing values – Introduction NextFiltering out columns containing mostly NAs

Last updated 4 months ago