Filtering out columns containing mostly NAs

If a column (lipid species) in a dataset contains a high proportion of missing values, it is often removed from the dataset. Retaining such columns would require imputing a significant portion of the missing data before performing any computations, which can introduce uncertainty. The threshold for the percentage of missing values at which a column is removed is typically determined by the statistician conducting the analysis and is often subjective. In metabolomics and lipidomics, columns with 35–65% missing values (commonly 50%) are frequently excluded from datasets to maintain data integrity.

Based on the analysis of missing observations in the previous chapter ("Detecting Missing Values"), we determined that no samples in our dataset have 50% or more missing values. Therefore, for this demonstration we propose filtering out all columns with 35% or more missing values.

For installing and loading the necessary packages and the example dataset we refer to the previous section in this chapter ("Detecting Missing Values"). We can then remove the species with 35% or more missing values with:

# Define the threshold
threshold = 0.35  # 35% missing values

# Identify columns to drop
columns_to_drop = df.columns[df.isnull().mean() >= threshold]

# Drop the columns
df_filtered = df.loc[:, df.isnull().mean() < threshold]

# Print the column names that were dropped
print("Columns dropped due to missing values (35% or more):")
print(columns_to_drop.tolist())

output:

Columns dropped due to missing values (35% or more):
['DG 30:0', 'Cer 40:1;O2', 'PC 32:1', 'PC 34:3', 'LPC 18:2', 'SM 32:1;O2']

This gives us a new DataFrame df_filtered, which contains only the species that have less than 35% missing values across the samples, and we observe 6 species were dropped.

PreviousDetecting missing values NextData imputation

Last updated 5 months ago