Data imputation
For installing and loading the necessary packages and the example dataset we refer to the previous section in this chapter ("Detecting Missing Values") and we assume species with high missingness have been filtered out (described in "Filtering out columns containing mostly NAs") . In addition, for KNN imputation, we need to install scikit-learn:
The most simple form of imputation is simply filling in the missing values with a zero, which we apply here to the filtered DataFrame obtained in the previous section "Filtering out columns containing mostly NAs".
Data imputation for missing values occurring predominantly among low-abundant features
Replacing missing values with 0 can significantly influence data architecture by shifting the mean and median. Generally, in OMICs data analysis, substituting many missing entries with a constant can harm data architecture. This approach can be used only if a few missing entries occur in the whole data set. A good example is the situation when NAs are present mainly in single columns corresponding to low-abundant lipids or metabolites, where it is possible that the concentration could not be determined because of reaching the limit of quantitation or even the limit of detection. In such a case, assuming that the concentration of a feature in some of the samples is <LOQ, we could, for example, use 80% of the lowest concentration in every column to replace missing values.
We will use a constant value, but it will be specific for every feature and, therefore - different for every column. See the code block below:
Data imputation using mean or median value
If missing values occur randomly in the entire data set (e.g., due to issues with data alignment in the data processing software, retention time shifts, etc.), substitution by a column mean or median may be a better choice than using imputation by a constant, like 0, a minimum column value, or a percentage of the minimum column value. This way means and medians in the data set will not be significantly shifted. Please see the code block below for replacing missing entries in columns by the mean and median values, respectively:
Replacing NAs via K-nearest neighbour (KNN)
Another method to replace missing entries is to estimate them using a model. The replacement of missing observations is frequently performed, e.g., via the K-nearest neighbour (KNN) model, but also other models can be applied for this purpose. The KNN model estimates missing values based on the similarity to neighbouring samples (data points).
Last updated