💪
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  • Required packages
  • Loading the data
  1. Missing values handling in Python

Detecting missing values

PreviousReplacing NAs via random forest (RF) model (randomForest library)NextFiltering out columns containing mostly NAs

Last updated 4 months ago

Required packages

The required packages for this section are pandas, matplotlib and seaborn. These can be installed with the following command in the command window (Windows) / terminal (Mac).

pip install pandas matplotlib seaborn

Loading the data

Place the downloaded Lipidomics_missing_values_EXAMPLE.xlsx file in the same folder as your JupyterLab script. Then run the following code in Jupyter:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_excel("Matrix_missing_values_EXAMPLE.xlsx", decimal=",")
df.set_index("Sample Name", inplace=True)

We can generate a heatmap visualisation of the missing values across the table (white values indicate a missing value):

plt.figure(figsize=(24, 30))  # Modify the width and height as needed
sns.heatmap(df.isnull(), cbar=False)
plt.savefig("missing_values_heatmap.png", dpi=200, bbox_inches='tight') 
plt.show()

We can visualize the % missing values in the samples:

# Calculate percentage of missing values per row (sample)
missing_percentage_per_sample = df.isnull().mean(axis=1) * 100

# Bar plot for missing values per sample
plt.figure(figsize=(30, 6))
missing_percentage_per_sample.sort_values(ascending=False).plot(kind='bar')
plt.title("Percentage of Missing Values per sample")
plt.xlabel("Samples")
plt.ylabel("Percentage Missing")
plt.tight_layout()
plt.show()

And for the species:

# Calculate percentage of missing values per row (sample)
missing_percentage_per_sample = df.isnull().mean(axis=1) * 100

# Bar plot for missing values per sample
plt.figure(figsize=(30, 6))
missing_percentage_per_sample.sort_values(ascending=False).plot(kind='bar')
plt.title("Percentage of Missing Values per sample")
plt.xlabel("Samples")
plt.ylabel("Percentage Missing")
plt.tight_layout()
plt.show()
248KB
Lipidomics_missing_values_EXAMPLE.xlsx
Heatmap indicating missing values in the data table, white values indicate missing data.