Filtering out columns containing mostly NAs

A part of missing values section

As you already know, if a column of a tibble contains mostly missing entries, it can be removed from the data set. Otherwise, we would need to 'create' by imputation a significant part of the data before performing any computations. The assumption on the percentage of missing values in a column qualifying it to be filtered out is usually an arbitrary decision of a statistician performing the data analysis. In metabolomics and lipidomics, the columns with 50% or more missing values are usually removed from data sets. A simple method to remove columns with a significant amount of missing observations would be to use mutate() function and set these columns to NULL (if a small number of them is to be filtered). Based on the analysis of missing observations made in the previous chapter ('Detecting Missing Values with DataExplorer /R package'), we have determined that in our dataset, there are no columns with 50% or more missing values. Therefore, now we can assume that we would be interested in filtering out all columns that contain 35% or more missing values.

# 6 columns contain more than 35% missing entries (Detection via DataExplorer).
# These columns are: 
#`PC 32:1`, `PC 34:3`, `LPC 18:2`, `DG 30:0`, `Cer 40:1;O2`, `SM 32:1;O2`.
# We can remove them using mutate() function:
data.missing.filtered <-
  data.missing %>%
  mutate(`PC 32:1`= NULL,
         `PC 34:3` = NULL,
         `LPC 18:2` = NULL,
         `DG 30:0` = NULL,
         `Cer 40:1;O2` = NULL,
         `SM 32:1;O2` = NULL)

However, we will show you here a more elegant way relying on apply() family of functions (sapply() in this case) and how the logical vector can be used to filter a data frame. Firstly, we create a condition. We need to compute the number of NAs in every column. It can be achieved through is.na() and sum() functions. Next, we will compare if a selected column contains more or less NAs than 35% of all samples measured (columns with more than 79 missing observations should be detected). Such a comparison in R will return a TRUE or FALSE result. To avoid testing this condition for every column, we will use sapply() function that performs this operation for us. The sapply() function returns a vector or matrix. If we test the TRUE/FALSE condition on each column of data.missing, we produce a logical vector having dimensions 1x129 (where 129 is the number of all columns tested). Using the logical vector with 129 entries, we can filter our data frame. Described procedure is shown here:

# Filtering out columns containing mostly missing observations.
# Assumption: filter out columns with 35% or more missing entries. 
# First, let's try to count NAs in a single column. We use:
sum(is.na(data.missing$`LPC 18:2`))

# Explanation:
# is.na() function detects missing values in `LPC 18:2` column of data.missing tibble,
# The sum() function sums them up. 
# In total, we have 91 missing entries.

# Now, let's turn this function into a condition:
sum(is.na(data.missing$`LPC 18:2`)) < 0.35 * nrow(data.missing)

# This results in 'FALSE', meaning that the column contains >35% NAs. 

# To test this condition on every column, we need sapply() function.
# sapply() will apply our condition to every column in 'data.missing'.
logi.vector <- sapply(data.missing, function(x) sum(is.na(x)) < 0.35*nrow(data.missing))

# Explanation:
# Take 'data.missing' from the global environment.
# Apply function sum(is.na(x)) < 0.35*nrow(data.missing) to all columns via sapply().
# Store the output as 'logi.vector'.

The output vector contains 129 columns with TRUE or FALSE entries. Now, we can use this logical vector for filtering the data. missing tibble. We will need to access all columns of our tibble. The simplest method to do so is simply by using:

data.missing[rows,columns]

In this way, we can select specific rows or columns using their number. See the example below:

# For example...
data.missing[1:2,] 
# ...allows selecting the first two rows of our 'data.missing' tibble.
Selection of first two rows of 'data.missing' tibble via data.missing[1:2,].
# In turn...
data.missing[,1:2] 
# ...allows selecting the first two columns of 'data.missing' tibble.
Selection of first two columns of 'data.missing' tibble via data.missing[,1:2].

To access all columns of data.missing, we could simply use data.missing[,1:129]. Now, we will apply the logi.vector to all columns of our tibble. All TRUE entries will keep the column in the tibble, while the FALSE entries will remove the column:

# Filtering 'data.missing' tibble using 'logi.vector':
data.missing.filtered <- data.missing[,logi.vector]
print(data.missing.filtered)
The 'data.missing' tibble after filtering out 6 columns containing >35% NA.

To sum up, the whole script could be shortened to this form:

# 1. Create a filtering condition: 
sum(is.na(x)) < 0.35*nrow(data.missing)

# 2. Test it on every column of 'data.missing' using sapply() and store the output:
logi.vector <- sapply(data.missing, function(x) sum(is.na(x)) < 0.35*nrow(data.missing))

# 3. Apply the 'logi.vector' to all columns and store the filtered tibble:
data.missing.filtered <- data.missing[,logi.vector]

After this step, your tibble is ready for the data imputation.

Filtering out columns with predominant NAs using tidyverse collection

An alternative to the base R sapply() / logi.vector is the filtering out columns through the select(where(...)) functions. Cleaning the data frame can also easily be achieved using the two simple lines of code below, which rely on a handy tidyverse solution. Please test the code block below and compare both outputs (from above and this one).

# First, the threshold for selection is defined.
# For example, keep columns with at least 65% non-NA values:
threshold <- 0.65 * nrow(data.missing)

# Next, we select columns with enough non-missing values:
data.missing.filtered.2 <- data.missing %>% select(where(~ sum(!is.na(.)) >= threshold))

# Step-by-step analysis of the code:
# Take the data.missing data frame,
# Pipe it to select() function,
# The more complex part: where(~ sum(!is.na(.)) >= threshold):
## is.na(.), aka "is NA", checks for NAs across columns (output is a logi vector)
## !is.na(.) aka "is not NA",  counts non-missing values (output is a logi vector, too)
## sum(!is.na(.)) sums up all the non-missing values
## sum(!is.na(.)) >= threshold) forms a condition for selection
## We compare the sum against the threshold (e.g., 65% non-missing)
## where() is just a selection helper -- selects variables with a function.

# Together, this keeps only columns with the required proportion of non-NA values.

Last updated