💪
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  • Data imputation for missing values occurring predominantly among low-abundant features
  • Data imputation using mean or median value
  • Replacing NAs via K-nearest neighbour (KNN)
  1. Missing values handling in Python

Data imputation

For installing and loading the necessary packages and the example dataset we refer to the previous section in this chapter ("Detecting Missing Values") and we assume species with high missingness have been filtered out (described in "Filtering out columns containing mostly NAs") . In addition, for KNN imputation, we need to install scikit-learn:

pip install -U scikit-learn

The most simple form of imputation is simply filling in the missing values with a zero, which we apply here to the filtered DataFrame obtained in the previous section "Filtering out columns containing mostly NAs".

df_imputed = df_filtered.fillna(0)

Data imputation for missing values occurring predominantly among low-abundant features

Replacing missing values with 0 can significantly influence data architecture by shifting the mean and median. Generally, in OMICs data analysis, substituting many missing entries with a constant can harm data architecture. This approach can be used only if a few missing entries occur in the whole data set. A good example is the situation when NAs are present mainly in single columns corresponding to low-abundant lipids or metabolites, where it is possible that the concentration could not be determined because of reaching the limit of quantitation or even the limit of detection. In such a case, assuming that the concentration of a feature in some of the samples is <LOQ, we could, for example, use 80% of the lowest concentration in every column to replace missing values.

We will use a constant value, but it will be specific for every feature and, therefore - different for every column. See the code block below:

# Define a function for imputation
def impute_with_80_percent_min(series):
    if pd.api.types.is_numeric_dtype(series):  # Check if the column is numeric
        min_value = series.min(skipna=True)   # Calculate the minimum value, ignoring NaN
        imputed_value = 0.8 * min_value       # Compute 80% of the minimum
        return series.fillna(imputed_value)  # Fill missing values with the computed value
    else:
        return series  # Return non-numeric columns unchanged

# Apply the imputation function to each column
df_imputed = df_filtered.apply(impute_with_80_percent_min, axis=0)

Data imputation using mean or median value

If missing values occur randomly in the entire data set (e.g., due to issues with data alignment in the data processing software, retention time shifts, etc.), substitution by a column mean or median may be a better choice than using imputation by a constant, like 0, a minimum column value, or a percentage of the minimum column value. This way means and medians in the data set will not be significantly shifted. Please see the code block below for replacing missing entries in columns by the mean and median values, respectively:

# Function to impute missing values with the median of each numeric column
def impute_with_median(series):
    if pd.api.types.is_numeric_dtype(series):  # Check if the column is numeric
        median_value = series.median(skipna=True)  # Calculate the median, ignoring NaN
        return series.fillna(median_value)         # Fill missing values with the median
    else:
        return series  # Return non-numeric columns unchanged

# Function to impute missing values with the mean of each numeric column
def impute_with_mean(series):
    if pd.api.types.is_numeric_dtype(series):  # Check if the column is numeric
        mean_value = series.mean(skipna=True)  # Calculate the mean, ignoring NaN
        return series.fillna(mean_value)       # Fill missing values with the mean
    else:
        return series  # Return non-numeric columns unchanged

          
df_imputed_median = df_filtered.apply(impute_with_median, axis=0)

df_imputed_mean = df_filtered.apply(impute_with_mean, axis=0)

Replacing NAs via K-nearest neighbour (KNN)

Another method to replace missing entries is to estimate them using a model. The replacement of missing observations is frequently performed, e.g., via the K-nearest neighbour (KNN) model, but also other models can be applied for this purpose. The KNN model estimates missing values based on the similarity to neighbouring samples (data points).

from sklearn.impute import KNNImputer
# Select only numeric columns for KNN imputation
numeric_columns = df_filtered.select_dtypes(include=['number'])

# Initialize the KNNImputer
knn_imputer = KNNImputer(n_neighbors=5, weights='uniform')  # You can adjust `n_neighbors`

# Perform KNN imputation
imputed_array = knn_imputer.fit_transform(numeric_columns)

# Convert the imputed array back to a DataFrame
df_imputed_knn = pd.DataFrame(imputed_array, columns=numeric_columns.columns, index=numeric_columns.index)

# Combine with non-numeric columns, if any
non_numeric_columns = df_filtered.select_dtypes(exclude=['number'])
df_imputed_knn = pd.concat([non_numeric_columns, df_imputed_knn], axis=1)
PreviousFiltering out columns containing mostly NAsNextData normalization in R - fundamentals

Last updated 4 months ago