Filtering out columns containing mostly NAs
A part of missing values section
As you already know, if a column of a tibble contains mostly missing entries, it can be removed from the data set. Otherwise, we would need to 'create' by imputation a significant part of the data before performing any computations. The assumption on the percentage of missing values in a column qualifying it to be filtered out is usually an arbitrary decision of a statistician performing the data analysis. In metabolomics and lipidomics, the columns with 50% or more missing values are usually removed from data sets. A simple method to remove columns with a significant amount of missing observations would be to use mutate()
function and set these columns to NULL (if a small number of them is to be filtered). Based on the analysis of missing observations made in the previous chapter ('Detecting Missing Values with DataExplorer /R package'), we have determined that in our dataset, there are no columns with 50% or more missing values. Therefore, now we can assume that we would be interested in filtering out all columns that contain 35% or more missing values.
However, we will show you here a more elegant way relying on apply()
family of functions (sapply()
in this case) and how the logical vector can be used to filter a data frame. Firstly, we create a condition. We need to compute the number of NAs in every column. It can be achieved through is.na()
and sum()
functions. Next, we will compare if a selected column contains more or less NAs than 35% of all samples measured (columns with more than 79 missing observations should be detected). Such a comparison in R will return a TRUE or FALSE result. To avoid testing this condition for every column, we will use sapply()
function that performs this operation for us. The sapply()
function returns a vector or matrix. If we test the TRUE/FALSE condition on each column of data.missing
, we produce a logical vector having dimensions 1x129 (where 129 is the number of all columns tested). Using the logical vector with 129 entries, we can filter our data frame. Described procedure is shown here:
The output vector contains 129 columns with TRUE or FALSE entries. Now, we can use this logical vector for filtering the data. missing
tibble. We will need to access all columns of our tibble. The simplest method to do so is simply by using:
data.missing[rows,columns]
In this way, we can select specific rows or columns using their number. See the example below:
To access all columns of data.missing
, we could simply use data.missing[,1:129]
. Now, we will apply the logi.vector
to all columns of our tibble. All TRUE entries will keep the column in the tibble, while the FALSE entries will remove the column:
To sum up, the whole script could be shortened to this form:
After this step, your tibble is ready for the data imputation.
Filtering out columns with predominant NAs using tidyverse collection
An alternative to the base R sapply()
/ logi.vector
is the filtering out columns through the select(where(...))
functions. Cleaning the data frame can also easily be achieved using the two simple lines of code below, which rely on a handy tidyverse solution. Please test the code block below and compare both outputs (from above and this one).
Last updated