💪
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  1. Missing values handling in R

Detecting missing values (DataExplorer R package)

A part of preparing data for statistical analysis and visualization

PreviousMissing values – IntroductionNextFiltering out columns containing mostly NAs

Last updated 3 months ago

Many R packages include R functions, which clearly define that inputs should contain no missing values. Therefore, we will show a useful R package for exploratory data analysis before we proceed to data analysis called DataExplorer (). DataExplorer connects code simplicity with informative outputs.

We will apply DataExplorer to learn about the number and distribution of missing entries in the exemplary data set we created in R for you.

Download Data set no. 2 from the subchapter Exemplary data sets (chapter: Introduction).

Set the working directory, install the DataExplorer library, and load all necessary libraries (DataExplorer, readxl, tidyverse collection). Then, read the data into R as 'data.missing' from the working directory. Recheck if the created object is tibble, and adjust column type if necessary:

# Setting working directory (wd).
# Create a folder containing the Excel sheet with Data set no.2, e.g., on D: drive.
# Indicate this folder as a "dir" in setwd().
# Use: setwd("dir"), e.g.:
setwd("D:/Data analysis")

# Data analysis folder must contain: "Lipidomics_missing_values_EXAMPLE.xlsx".

# Installing DataExplorer library.
install.packages('DataExplorer')
# Remember, the package is installed only once!

# Calling libraries necessary for the inspection
library(DataExplorer)
library(readxl)
library(tidyverse)

# Reading new data into R, checking object type and column types:
data.missing <- read_xlsx(file.choose())
print(data.missing)
data.missing$Label <- as.factor(data.missing$Label)

We can get introduced to the object we created by the introduce() function. The introduce()function creates a tibble with basic information about our data set, and we will store it as 'introduction':

# Getting to know our data using introduce() function from the DataExplorer:
introduction <- introduce(data.missing)
View(introduction)

We obtain the following output:

This tibble can be turned into a plot via plot_intro() function:

# Plotting information on the data set via plot_intro():
plot_intro(data.missing)

We obtain the following chart:

Looking at the chart, we immediately realize we will mostly handle continuous variable columns (98.4%, containing concentrations of lipids in this case), and only 1.6% of discrete columns are represented by the only Label factor column. Another important piece of information is that none of the rows of the tibble is complete, meaning all measured samples contain missing values. We also realize we have 29.5% of missing observations, which constitutes a significant amount. We will try to impute them. Finally, we can take a look at the profiles of missing values using plot_missing() function:

# Inspecting profiles of missing values using plot_missing() from DataExplorer.
# We will inspect missing values' profiles class by class. 
# It will allow us to avoid huge, unreadable output.
# For this purpose, we will use select() with helper starts_with('lipid_class_name').

# EXAMPLE 1: Plotting missing values profiles for LPC, PC, PC O-, and PC P-.
data.missing %>% 
  select(`Sample Name`, 
         `Label`, 
         starts_with('LPC') | starts_with('PC')) %>%
  plot_missing()
  
# Explanation:
# Take 'data.missing' from the global environment.
# Push through the pipe to select(),
# select() columns: `Sample Name`, `Label`, and columns whose names start with strings:
# 'LPC' OR `PC'.
# Push the selected columns to the plot_missing() function from DataExplorer library.

# EXAMPLE 2: Plotting missing values profiles for MG, DG, and TG.
data.missing %>% 
  select(starts_with('MG') | starts_with('DG') | starts_with('TG')) %>%
  plot_missing()
  
# Explanation:
# Take 'data.missing' from the global environment.
# Push through the pipe to select(),
# select() columns whose names start with strings:
# 'MG' OR 'DG' OR 'TG'.
# Push the selected columns to the plot_missing() function from DataExplorer library.

# EXAMPLE 3: Plotting missing values profiles for Cer and SM.
data.missing %>% 
  select(starts_with('Cer') | starts_with('SM')) %>%
  plot_missing()

# Explanation:
# Take 'data.missing' from the global environment.
# Push through the pipe to select(),
# select() columns whose names start with strings:
# 'Cer' OR 'SM'.
# Push the selected columns to the plot_missing() function from DataExplorer library.

# EXAMPLE 4: Plotting missing values profiles for CE. 
# NOTE: after CE we kept space so Cer are not selected.
data.missing %>% 
  select(starts_with('CE ')) %>%
  plot_missing()
  
# Explanation:
# Take 'data.missing' from the global environment.
# Push through the pipe to select(),
# select() columns whose names start with strings:
# 'CE<space>'.
# Push the selected columns to the plot_missing() function from DataExplorer library.

The output from EXAMPLE 1:

It is worth highlighting that these plots have only diagnostic value, and usually, they are not presented in the manuscripts. The output from the DataExplorer also suggests whether a column should be kept and missing values replaced (Band: Good or OK), considered to be removed (Band: Bad), or removed because it contains no data (Band: Removed).

The script containing functions from DataExplorer for the initial inspection of the data set can be downloaded here:

In the upcoming steps, we will demonstrate how to filter out columns with a high percentage of missing values and how to impute NAs in the remaining columns.

https://boxuancui.github.io/DataExplorer/
3KB
First data inspection with DataExplorer - detection of missing values.R
Missing values inspection using the DataExplorer library.
A tibble with information about our 'data.missing' object generated by introduce() function from DataExplorer package.
Chart with summary of the 'data.missing' tibble.
Missing values profiles for LPC, PC, PC P-, and PC O- were plotted using plot_missing() from the DataExplorer library (example of the obtained plot).