Missing values – Introduction

A part of missing values section

The data should be checked for missing values (‘NA’, ‘NaN’, empty entries in columns). Empty entries are a common issue in mass spectrometry-derived datasets. While statistical methods robust to missing values exist, a common trend relies on their imputation in lipidomics and metabolomics. The missing values can complicate performing data transformation, hypothesis testing, computations related to descriptive statistics, or multivariate analyses. R or Python functions are likely to require complete input data. Additionally, a common practice involves applying a logarithmic transformation to right-skewed data (caused by outliers). This process renders the distribution more symmetric, approximating normality. Log-transformation application is problematic when the 'NA' values are substituted simply by '0' because we can recall from basic mathematics that log 0 is undefined.

Suppose many missing values are detected in a dataset with a random distribution throughout the data frame. In that case, one should consider inspecting the raw data, acquisition method, and processing method and repeat the analysis and/or data processing if necessary.

There are many reasons for missing values in omics data sets. Missing entries can originate from uncontrolled biological effects, wrong sample storage, collection strategy, preservation, preparation, or normalization of sample amount. Additionally, technical issues related to mass spectrometer settings or measurement methods may arise, such as analyte ionization problems, ionization suppression from other analytes or the matrix, inefficient ion transfer to the mass analyzer, or instrument tuning and calibration problems. In LC-based approaches, improper chromatographic settings or performance issues can lead to shifts in analyte retention times, chromatographic peak deformations affecting quantitation, and co-elutions impacting lipidome (metabolome) coverage and quantitation. Then, the appropriate settings of data processing software are crucial, including signal-to-noise, correct peak picking, mathematical corrections, alignment, etc. Incomplete patient medical records may contribute to missing values within clinical variables.

In literature three types of missing values are currently considered, including completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). This classification is based on two types of variables: observed variables (i.e., measurements available in the dataset) and unobserved variables (i.e., measurements missing from the dataset). MCAR are effect of a purely random event unrelated to observed or unobserved variables, e.g., vials with samples were damaged, and the material was lost, or power loss resulted in missing measurements at the end of the sequence. A classic, often-mentioned example of MAR values (dependent on observed data) is two co-eluting analytes and their faulty deconvolution, i.e., a mathematical attempt to resolve overlapped peaks into their components. MNAR values are predominantly related to lipids or metabolites with abundance below the detection limit (< LOD) within a used analytical method.

If specific samples contain a significant percentage of missing values, they can be filtered out from the entire data set. Further, it's a common practice to remove lipids or metabolites for which a defined part of measurements is missing, e.g., 35-50%. A key consideration is that when different types of missing values, such as MCAR (MAR) and MNAR, are present in a dataset, a single imputation method often performs well for one type but not for both. Moreover, MNAR values are often the most challenging to impute, and using observed data to replace them can introduce additional bias.

Different strategies exist for dealing with missing values in lipidomics and metabolomics. Many reviews and research articles are dedicated to missing value imputation in proteomics, lipidomics, and metabolomics. Please read them carefully, as we recommend selected ones in the literature section at the end of this chapter. Among the most often mentioned methods are:

replacing missing values with a constant, e.g., zero value, column mean, median, a percentage of lowest concentration in a column,
kNN-based approaches,
random forest-based approaches.

For all columns containing low abundant lipids or metabolites, zero outputs are often replaced by a constant for each feature, like a concentration close to the limit of quantitation or, for example, a percentage of the lowest concentration for each feature (e.g., 80% of the lowest concentration in a column). One should be aware that the potential risk of using this method is related to introducing bias, e.g., influencing mean (median) concentration and standard deviation. This approach was used in the following studies, e.g.:

Wolrab et al. Lipidomic profiling of human serum enables detection of pancreatic cancer
Idkowiak et al. Robust and high-throughput lipidomic quantitation of human blood samples using flow injection analysis with tandem mass spectrometry for clinical use
Peterka & Maccelli et al. HILIC/MS quantitation of low-abundant phospholipids and sphingolipids in human plasma and serum: Dysregulation in pancreatic cancer

As reported by several studies, random forest and k-nearest neighbor (kNN) models were consequently optimal for MCAR and MAR in lipidomics and metabolomics datasets. Frölich et al. found that kNN is also an optimal approach for MNAR in shotgun lipidomics data. QRILC, i.e., a quantile regression imputation of left-censored data, was also proposed for MNAR in studies published by Wei et al. and mentioned by Frölich et al. in their manuscript.

For missing values within numeric clinical variables, substitution by mean or median value within a group can be used.

PCA and statistical tests robust to missing values (advanced)

The manuscript and GitBook present statistical and machine learning solutions in R and Python, which in most cases require complete lipidomics (metabolomics) datasets, meaning data that has undergone missing values imputation. As mentioned above, this is also a usual practice in the case of lipidomics and metabolomics datasets. Alternatively, some libraries handle missing values by removing rows with empty entries, such as the rstatix library, which we recommend for hypothesis testing.

It is necessary to notice that PCA and statistical tests robust to missing values exist, e.g., sequential projection pursuit PCA (sppPCA) defining principal components in the presence of missing values or IMD−ANOVA combining G-test results, evaluating the independence of missing data (IMD) with an analysis of variance (ANOVA). The value of their application stems from their ability to avoid the pitfalls of data imputation. Both methods are presented in the following outstanding manuscripts:

It should be noted that the mixOmics library, containing methods such as (s)PLS, (s)PLS-DA, and (s)PCA, uses the NIPALS algorithm to handle missing values. We present this solution in the chapter dedicated to multivariate statistical analysis, specifically in the case of PCA and PLS analysis. For more information, please visit the FAQ of mixOmics:

PreviousLoading data into Python NextDetecting missing values (DataExplorer R package)

Last updated 2 months ago

PCA and statistical tests robust to missing values (advanced)

FAQmixOmics

FAQ of mixOmics - under 'Pre-Processing' section the information can be found on how the library deals with missing entries.