> For the complete documentation index, see [llms.txt](https://laboratory-of-lipid-metabolism-a.gitbook.io/omics-data-visualization-in-r-and-python/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://laboratory-of-lipid-metabolism-a.gitbook.io/omics-data-visualization-in-r-and-python/missing-values-handling-in-r/filtering-out-columns-containing-mostly-nas.md).

# Filtering out columns containing mostly NAs

As you already know, if a column of a tibble contains mostly missing entries, it can be removed from the data set. Otherwise, we would need to 'create' by imputation a significant part of the data before performing any computations. The assumption on the percentage of missing values in a column qualifying it to be filtered out is usually an arbitrary decision of a statistician performing the data analysis. In metabolomics and lipidomics, the columns with 50% or more missing values are usually removed from data sets. A simple method to remove columns with a significant amount of missing observations would be to use `mutate()` function and set these columns to NULL (if a small number of them is to be filtered). Based on the analysis of missing observations made in the previous chapter ('Detecting Missing Values with DataExplorer /R package'), we have determined that in our dataset, there are no columns with 50% or more missing values. Therefore, now we can assume that we would be interested in filtering out all columns that contain 35% or more missing values.

```r
# 6 columns contain more than 35% missing entries (Detection via DataExplorer).
# These columns are: 
#`PC 32:1`, `PC 34:3`, `LPC 18:2`, `DG 30:0`, `Cer 40:1;O2`, `SM 32:1;O2`.
# We can remove them using mutate() function:
data.missing.filtered <-
  data.missing %>%
  mutate(`PC 32:1`= NULL,
         `PC 34:3` = NULL,
         `LPC 18:2` = NULL,
         `DG 30:0` = NULL,
         `Cer 40:1;O2` = NULL,
         `SM 32:1;O2` = NULL)
```

However, we will show you here a more elegant way relying on `apply()` family of functions (`sapply()` in this case) and how the logical vector can be used to filter a data frame. Firstly, we create a condition. We need to compute the number of NAs in every column. It can be achieved through `is.na()` and `sum()` functions. Next, we will compare if a selected column contains more or less NAs than 35% of all samples measured (columns with more than 79 missing observations should be detected). Such a comparison in R will return a TRUE or FALSE result. To avoid testing this condition for every column, we will use `sapply()` function that performs this operation for us. The `sapply()` function returns a vector or matrix. If we test the TRUE/FALSE condition on each column of `data.missing`, we produce a logical vector having dimensions 1x129 (where 129 is the number of all columns tested). Using the logical vector with 129 entries, we can filter our data frame. Described procedure is shown here:

```r
# Filtering out columns containing mostly missing observations.
# Assumption: filter out columns with 35% or more missing entries. 
# First, let's try to count NAs in a single column. We use:
sum(is.na(data.missing$`LPC 18:2`))

# Explanation:
# is.na() function detects missing values in `LPC 18:2` column of data.missing tibble,
# The sum() function sums them up. 
# In total, we have 91 missing entries.

# Now, let's turn this function into a condition:
sum(is.na(data.missing$`LPC 18:2`)) < 0.35 * nrow(data.missing)

# This results in 'FALSE', meaning that the column contains >35% NAs. 

# To test this condition on every column, we need sapply() function.
# sapply() will apply our condition to every column in 'data.missing'.
logi.vector <- sapply(data.missing, function(x) sum(is.na(x)) < 0.35*nrow(data.missing))

# Explanation:
# Take 'data.missing' from the global environment.
# Apply function sum(is.na(x)) < 0.35*nrow(data.missing) to all columns via sapply().
# Store the output as 'logi.vector'.
```

The output vector contains 129 columns with TRUE or FALSE entries. Now, we can use this logical vector for filtering the `data. missing` tibble. We will need to access all columns of our tibble. The simplest method to do so is simply by using:&#x20;

**data.missing\[rows,columns]**

In this way, we can select specific rows or columns using their number. See the example below:

```r
# For example...
data.missing[1:2,] 
# ...allows selecting the first two rows of our 'data.missing' tibble.
```

<figure><img src="/files/kwCF6FL8Ej1FSWh2T5eR" alt=""><figcaption><p>Selection of first two rows of 'data.missing' tibble via data.missing[1:2,].</p></figcaption></figure>

```r
# In turn...
data.missing[,1:2] 
# ...allows selecting the first two columns of 'data.missing' tibble.
```

<figure><img src="/files/G5MJzYwIAFWN7S4tnOzP" alt=""><figcaption><p>Selection of first two columns of 'data.missing' tibble via data.missing[,1:2].</p></figcaption></figure>

To access all columns of `data.missing`, we could simply use `data.missing[,1:129]`. Now, we will apply the `logi.vector` to all columns of our tibble. All TRUE entries will keep the column in the tibble, while the FALSE entries will remove the column:

```r
# Filtering 'data.missing' tibble using 'logi.vector':
data.missing.filtered <- data.missing[,logi.vector]
print(data.missing.filtered)
```

<figure><img src="/files/n1irMPRgITiFLWV37xtF" alt=""><figcaption><p>The 'data.missing' tibble after filtering out 6 columns containing >35% NA.</p></figcaption></figure>

To sum up, the whole script could be shortened to this form:

```r
# 1. Create a filtering condition: 
sum(is.na(x)) < 0.35*nrow(data.missing)

# 2. Test it on every column of 'data.missing' using sapply() and store the output:
logi.vector <- sapply(data.missing, function(x) sum(is.na(x)) < 0.35*nrow(data.missing))

# 3. Apply the 'logi.vector' to all columns and store the filtered tibble:
data.missing.filtered <- data.missing[,logi.vector]
```

After this step, your tibble is ready for the data imputation.

## Filtering out columns with predominant NAs using tidyverse collection

An alternative to the base R `sapply()` / `logi.vector` is the filtering out columns through the `select(where(...))` functions. Cleaning the data frame can also easily be achieved using the two simple lines of code below, which rely on a handy *tidyverse* solution. Please test the code block below and compare both outputs (from above and this one).

```r
# First, the threshold for selection is defined.
# For example, keep columns with at least 65% non-NA values:
threshold <- 0.65 * nrow(data.missing)

# Next, we select columns with enough non-missing values:
data.missing.filtered.2 <- data.missing %>% select(where(~ sum(!is.na(.)) >= threshold))

# Step-by-step analysis of the code:
# Take the data.missing data frame,
# Pipe it to select() function,
# The more complex part: where(~ sum(!is.na(.)) >= threshold):
## is.na(.), aka "is NA", checks for NAs across columns (output is a logi vector)
## !is.na(.) aka "is not NA",  counts non-missing values (output is a logi vector, too)
## sum(!is.na(.)) sums up all the non-missing values
## sum(!is.na(.)) >= threshold) forms a condition for selection
## We compare the sum against the threshold (e.g., 65% non-missing)
## where() is just a selection helper -- selects variables with a function.

# Together, this keeps only columns with the required proportion of non-NA values.
```


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://laboratory-of-lipid-metabolism-a.gitbook.io/omics-data-visualization-in-r-and-python/missing-values-handling-in-r/filtering-out-columns-containing-mostly-nas.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
