💪
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  1. Missing values handling in R
  2. Data imputation by different available R libraries

Replacing NAs via k-nearest neighbor (kNN) model (VIM library)

A part of missing value – data imputation section

PreviousData imputation using recipes library (tidymodels)NextReplacing NAs via random forest (RF) model (randomForest library)

Last updated 2 months ago

Another method to replace missing entries, such as MCAR and MAR, is to estimate them using a model. Replacing missing observations in metabolomics and lipidomics is frequently performed, e.g., via the k-nearest neighbor (kNN) model. For instance, as an example, take a look at the following manuscript:

  • M. Kaleta et al. Patients with Neurodegenerative Proteinopathies Exhibit Altered Tryptophan Metabolism in the Serum and Cerebrospinal Fluid. ACS Chemical Neuroscience (2024). DOI:

Some studies report kNN suitability for MNAR, too, e.g.:

The kNN model estimates missing values based on the similarity to neighboring samples (data points). The KNN imputation can be easily implemented using VIM package (Visualization and Imputation of Missing Values):

First, we will install this library (this operation is performed once), then load it, and read the documentation regarding the function of interested, which in this case will be kNN():

# Installing VIM package:
install.packages('VIM')

# Calling VIM package:
library(VIM)

# Reading documentation about kNN() function:
?kNN()

The kNN() function application is quite straightforward and, thus, one of the most used methods in OMICs analysis. We will adjust the number of neighbors to 10, and switch the imp_var argument to FALSE as we do not need to know in what entries of our tibble lipid concentration were imputed:

# Imputing missing values through KNN model with VIM package:
data.imputed.knn <- as_tibble(kNN(data.missing, k = 10, imp_var = F))

# The kNN() returns the data frame. To change it into tibble apply as_tibble().
print(data.imputed.knn)

We obtain the following final output:

https://doi.org/10.1021/acschemneuro.3c00611
Imputation of missing values in lipidomic datasetsAnalytical Science Journals
N. Frölich et al. Imputation of missing values in lipidomic datasets. The authors report that kNN is suitable for imputing MNAR in shotgun lipidomics data.
Logo
VIM: Visualization and Imputation of Missing Values_R_Foundation
VIM package at CRAN.
Logo
Replacing missing values in the 'data.missing' tibble via KNN model (VIM package).