Data wrangling syntaxes useful in OMICs mining

Useful tricks and features in OMICs mining

In the previous subchapters, we showed how to modify data format and build pipelines. Now, it is time to move to data-wrangling verbs provided by the dplyr library (tidyverse collection). Data wrangling holds significance in the field of OMICs analysis as information seldom arrives in a format perfectly tailored for specific analyses.

Using single-word commands, you can select or rearrange columns and rows and then group and summarize them. For example, while computing descriptive statistics for a lipidomics or metabolomics data set, we can:

1) Select columns for comparisons: a factor column (our biological group) and numeric columns (with lipid or metabolite concentrations). This way, we separate interesting bits of our data from the entire data frame, which frequently contains additional information, e.g., sample names, batch numbers, clinical parameters, patient code, etc.

2) If necessary, we can filter the data by rows, e.g., to remove for computations QC samples, blanks, system standards, or even remove an entire group of samples that we do not want to analyze/compare.

3) We can arrange the samples in a specific way to examine the content after filtrations and selections.

4) We can rearrange classic wide tibbles into long ones.

5) In long tibble, we can group entries for each lipid and metabolite for further analysis.

6) Finally, we can summarize our data for each biological group separately.

You can create a new column if necessary, e.g., perform a log transformation and store the values in a new column or compute a ratio of two values.

Using pipes, you can pass data from one function to the other.

Here, we will show you how to use the following functions:

select()
filter()
mutate()
across() & where()
group_by()
arrange()
slice()

Some of these functions were used already in the previous examples. To use all of them, we call again tidyverse collection:

# Calling library
library(tidyverse)

select()

The function select() pulls a selected column from a data frame. Columns can be selected by their name or column numbers, as in the example below:

# Selecting columns by name from the 'data' and storing as 'data.selected':
data.selected.1 <-
  data %>% 
  select(`Sample Name`,
         `Label`, 
         `LPC 16:0`,
         `LPC 18:0`,
         `LPC 18:1`,
         `LPC 18:2`,
         `SM 32:1;O2`,
         `SM 39:1;O2`,
         `SM 41:1;O2`)
         
# In these lines, we:         
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe to select(),
# 3. Keep only columns selected using column names,
# 4. Store the output as 'data.selected.1'.
                           
print(data.selected.1)
         
# Selecting columns by column number from the 'data' and storing as 'data.selected':
data.selected.2 <-
  data %>% 
  select(1:3, 65, 125:129)
  
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe to select(),
# 3. Keep only columns selected by column number,
# 4. Store the output as 'data.selected.2'.

print(data.selected.2)

We obtain two following outputs:

More examples and information can be found here:

filter()

The function filter() is used to keep rows that fulfill a condition provided as an argument to this function. First, you have to learn what relational and logical operators can be used to formulate a condition:

# Relational operators used to formulate conditions:
# 1. == equal to
# 2. != is not equal to
# 3. > is greater than
# 4. < is less than
# 5. >= is greater than and equal to
# 6. <= is less than and equal to

# Logical operators to formulate conditions:
# 1. & AND 
# 2. | OR
# 3. ! NOT

# Additionally, it is good to know the arithmetic operators:
# 1. + addition 
# 2. - subtraction
# 3. ^ exponent 
# 4. * multiplication
# 5. / division
# 6. %% modulus

# Assignment operators (reminder):
# 1. = e.g. data = read_xlsx(file.choose())
# 2. <- e.g. data <- read_xlsx(file.choose()) - leftwards assignment (at the beginning)
# 3. -> e.g. read_xlsx(file.choose()) -> data - rightwards assignment (at the end)

Now, we will use relational and logical operators to create exemplary filters. See the code block below:

# Filtering data with filter() function based on a condition:

# EXAMPLE 1
data.filtered.1 <-
  data %>% 
  filter(Label == 'N' | Label == 'T')
  
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe to filter(),
# 3. Keep only rows containing in the Label column 'N' OR 'T',
# 4. Store the output as 'data.filtered.1'.
  
# To check the effect of the filtration run:
summary(data.filtered.1)

# As you see, even though we filtered all 'PAN' out, the PAN factor is remembered.
# To completely remove the 'PAN' factor from the data frame you need to run additionally:
data.filtered.1 <- data.filtered.1 %>% droplevels()
summary(data.filtered.1)
# The droplevels() function from base R drops unused factors in data frames.

# EXAMPLE 2
data.filtered.2 <- 
  data %>% 
  filter(`CE 16:1` > 1000)
  
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe to filter(),
# 3. Keep only rows containing the concentration of CE 16:1 over 1000,
# 4. Store the output as 'data.filtered.2'.

# Printing new tibble obtained after filtration of 'data':
print(data.filtered.2)

# EXAMPLE 3
data.filtered.3 <- 
  data %>% 
  filter(`CE 16:1` >= median(`CE 16:1`, na.rm = TRUE))
  
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe to filter(),
# 3. Keep only rows with a concentration of CE 16:1 higher than a median in this column,
# 4. Store the output as 'data.filtered.3'.
# 5. If we set the argument na.rm to TRUE, missing values will be removed.

median(data$`CE 16:1`)
# Median is equal to 538.9805. All values above or equal will be retained.
# Recheck of the filtration process:
summary(data.filtered.3)

# EXAMPLE 4
data.filtered.4 <-
  data %>% 
  filter(`CE 16:1` > median(`CE 16:1`) & Label != 'N')

# 1. Take 'data' from the global environment,
# 2. Push it through the pipe to filter(),
# 3. Keep rows with a concentration of CE 16:1 higher than 1000 and `Label` is not N,
# 4. Store the output as 'data.filtered.4'.
# Recheck of the filtration process:
summary(data.filtered.4)

The output of each of these lines is summarized below:

For more examples and explanations, go to:

mutate()

The mutate() function is used to create, modify, and delete columns in a data frame. Examples of its application are in the code block below:

# Applications of mutate() function
# EXAMPLE 1a - creating a new column
data.mutated.1a <- 
  data %>%
  select(`Sample Name`, `Label`, `LPC 18:1`, `Cer 41:1;O2`, `SM 41:1;O2`) %>%
  filter(Label == 'N' | Label == 'T') %>%
  mutate(`LPC 18:1 [pmol/mL]` = 1000 * `LPC 18:1`)

# In these lines of code, we:
# 1. Take the 'data' from the global environment,
# 2. Push it through the pipe, 
# 3. Select certain columns from the entire data set,
# 4. Push it through the pipe,
# 5. Filter to keep only rows for which column `Label` contains 'N' OR 'T',
# 6. Push it through the pipe,
# 7. We use mutate() to create one column: conc. of LPC 18:1 expressed in [pmol/mL].

print(data.mutated.1a)

# EXAMPLE 1b - creating new columns (more than one)
data.mutated.1b <- 
  data %>%
  select(`Sample Name`, `Label`, `LPC 18:1`, `Cer 41:1;O2`, `SM 41:1;O2`) %>%
  filter(Label == 'N' | Label == 'T') %>%
  mutate(`Log10(LPC 18:1)` = log10(`LPC 18:1`),
         `Log2(Cer 41:1;O2)` = log(`Cer 41:1;O2`,2),
         `Cer 41:1;O2 to SM 41:1;O2` = `Cer 41:1;O2`/`SM 41:1;O2`)

# In these lines of code, we:
# 1. Take the 'data' from the global environment,
# 2. Push it through the pipe, 
# 3. Select certain columns from the entire data set,
# 4. Push it through the pipe,
# 5. Filter to keep only rows for which column `Label` contains 'N' OR 'T',
# 6. Push it through the pipe,
# 7. We use mutate() to create three new columns:
#   1) `Log10(LPC 18:1)`, which is log10 of all concentrations of LPC 18:1,
#   2) `Log2(Cer 41:1;O2)`, which is log2 of all concentrations of Cer 41:1;O2,
#   3) `Cer 41:1;O2 to SM 41:1;O2`, which is ratio of Cer 41:1;O2 to SM 41:1;O2 conc.

print(data.mutated.1b)

And the modified tibble:

# EXAMPLE 2a - modifying content of columns:
data.mutated.2a <- 
  data %>%
  select(`Sample Name`, `Label`, `LPC 18:1`, `Cer 41:1;O2`, `SM 41:1;O2`) %>%
  filter(Label == 'N' | Label == 'T') %>%
  mutate(`LPC 18:1` = 1000 * `LPC 18:1`) 

# In these lines of code we:
# 1. Take the 'data' from the global environment,
# 2. Push it through the pipe, 
# 3. Select certain columns from the entire data set,
# 4. Push it through the pipe,
# 5. Filter to keep only rows for which column `Label` contains 'N' OR 'T',
# 6. Push it through the pipe,
# 7. We use mutate() to express the concentration of LPC 18:1 in [pmol/mL].
# Here, we modify the `LPC 18:1` by multiplying the concentration in the column by 1000, 
# No new column was created.

And the output:

The mutate() function can be used for more complex modifications, including so-called 'reordering factor levels'. 'Reordering factor levels' sounds difficult, but it is a simple operation. Suppose you have three biological groups annotated as 'Healthy', 'Cancer', and 'Therapy'. By default, R will rely on the alphabetic order of factors. It means that when you would plot box plots or bar plots, they would automatically appear in the alphabetic order in the chart: the first group is 'Cancer', then 'Healthy', and finally, 'Therapy'. However, the alphabetic order is not how we want to arrange the groups in the plot. This is where we need to relevel factors and specify the 'Healthy' group should be the first one, the 'Cancer' should be the second one, and the 'Therapy' should be the last one. Except for the dplyr package, we will need the fct_relevel() function from the forcats package for such an operation. As a reminder, the forcats library contains tools for working with factors. Except for plotting, the levels of factors are also essential for machine learning (defining the target group), as you will see in the next chapters. Let's relevel the factors in our data set using mutate() and fct_relevel(). We will specify PDAC patients (T) as a target group (the primary group). Here is the code with explanations:

# EXAMPLE 2b
# NOTE!
# Remember that the `Label` column type has to be adjusted to factor!
# Otherwise, the output will be NULL. 
# Adjust the column type if necessary:
data$Label <- as.factor(data$Label)

# Using mutate() and fct_relevel() to relevel factors in 'data' (modifying with mutate).
# First, check the order of factors using base R function levels():
levels(data$Label)

# The output in the R console:
[1] "N"   "PAN" "T"

# As you see, the target group is now set to 'N'. The order is alphabetic.
# If you would like to plot box plots, they would appear in the order from above.
# Now, let's use mutate() to relevel factors and set 'T' as a target factor:
data <- 
  data %>%
  mutate(Label = fct_relevel(Label, "T"))
  
# Explanation:
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe to mutate(),
# 3. mutate() the `Label` column, namely, relevel the factors and set 'T' as the target.
# 4. Store it all as 'data' in the global environment.

# Now, let's check again the levels of factors:
levels(data$Label)

# The output in the R console:
[1] "T"   "N"   "PAN"

# Using fct_relevel(), one can also specify the order of other factors.
# Otherwise, as you see above, the rest will be arranged alphabetically.
# To relevel all factors according to your preferences, run:
data <- 
  data %>%
  mutate(Label = fct_relevel(Label, "T", "PAN", "N"))

# Recheck the new levels of factors:
levels(data$Label)

# And the final output in the R console:
[1] "T"   "PAN" "N"

across() and where()

We have been working with single columns so far. However, the mutate() function can be used to perform modifications of multiple columns in one line of code. Here, we will need to introduce two additional functions: across() and where(). By using across() (dplyr library), we can simplify performing the same transformation on multiple columns. In turn, where() (tidyselect library) is a selection helper. It selects variables for which a condition would return TRUE, e.g., is.charater output is TRUE. See the first example below:

# EXAMPLE 2c
# Using mutate to modify all <dbl> columns into <int>:
data.mutated.2c <- 
  data %>%
  mutate(across(where(is.numeric), as.integer))
  
# Explanation:
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe to mutate(),
# 3. mutate() across all columns where is.numeric returns T, and change them into <int>,
# 4. Store the modified tibble in the global environment as 'data.mutated.2c'. 

print(data.mutated.2c)

This line of code produces the following output:

It is also possible to introduce more than one condition as an argument for the across() function:

# EXAMPLE 2d
# Using mutate to modify selected <dbl> columns into <int>:
# Two conditions under across()
data.mutated.2d <- 
  data %>%
  mutate(across(!c(`CE 16:1`, `CE 16:0`, `CE 18:3`) & where(is.numeric), as.integer))
  
  # Explanation:
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe to mutate(),
# 3. mutate() across all columns
#    1) EXCEPT FOR `CE 16:1`, `CE 16:0`, and `CE 18:3` 
#        - provided as vector -- c() 
#        - '!' means IS.NOT  
#    AND
#    
#    2) is.numeric returns T
#    ...and change them into <int>,
# 4. Store the modified tibble in the global environment as 'data.mutated.2d'.
  
print(data.mutated.2d)

And the output:

Using mutate(), across(), and where(), you can also perform log-transformation or scaling:

# EXAMPLE 2e
# Using mutate to log-transform all numeric columns:
data.mutated.2e <- 
  data %>%
  mutate(across(where(is.numeric), log10))
  
# Explanation:
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe to mutate(),
# 3. mutate() across all columns where is.numeric returns T, and change them into log10,
# 4. Store the modified tibble in the global environment as 'data.mutated.2e'.  
  
print(data.mutated.2e)

...and output:

To perform Pareto-scaling, we must deliver a Pareto-scaling function to the mutate() function. We can also define such a function in R. See the following example below:

# EXAMPLE 2f
# Using mutate to Pareto-scale all numeric columns (advanced):
# First, a Pareto-scaling function must be created. 
# As you are starting with R, use the function below that we created for you:
Pareto.scaling <- function(x) {(x-mean(x))/sqrt(sd(x))}

# It enables performing Pareto-scaling for every element in a column of a data frame.

# Pareto-scaling is widely used in lipidomics and metabolomics.
# We are going to Pareto-scale the log-transformed data stored as 'data.mutated.2e'.
data.mutated.2f <- 
  data.mutated.2e %>%
  mutate(across(where(is.numeric), Pareto.scaling))
  
# Explanation:
# 1. Take 'data.mutated.2e' from the global environment,
# 2. Push it through the pipe to mutate(),
# 3. mutate() across all columns where is.numeric returns T, and change them into Pareto-scaled values,
# 4. Store the modified tibble in the global environment as 'data.mutated.2f'. 

print(data.mutated.2f)

The final output constitutes a data frame with log-transformed and Pareto-scaled data. Using such a data set, for example, a PCA analysis could be performed. Take a look at the final data set:

Finally, we will show you that you can also delete columns using mutate(). Simply, set the column name to NULL, to eliminate it from the data set:

# EXAMPLE 3
# Using mutate to remove columns from a data frame (tibble):
data.mutated.3 <- 
  data %>%
  mutate(`CE 16:1` = NULL,
         `CE 16:0` = NULL,
         `CE 18:3` = NULL,
         `CE 18:2` = NULL)
         
# Explanation:
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe to mutate(),
# 3. mutate(): remove columns: `CE 16:1`, `CE 16:0`, `CE 18:3`, `CE 18:2`,
# 4. Store the modified tibble in the global environment as 'data.mutated.3'. 

print(data.mutated.3)

And the output:

Additional examples and explanations are also available on the tidyverse collection website:

group_by()

The group_by() function groups data by one or more factors. The group_by() function could be applied to group data for computing statistical parameters, e.g., mean, median, or hypothesis testing per a defined group. Examples are shown below and in the next subchapters.

# Application of group_by() function, e.g. computing mean & SD
mean.sd <- 
  data %>%
  select(`Sample Name`, 
         `Label`, 
         `LPC 18:2`, 
         `Cer 39:1;O2`, 
         `Cer 41:1;O2`, 
         `SM 41:1;O2`) %>%
  pivot_longer(cols = `LPC 18:2`:`SM 41:1;O2`,
               names_to = "Lipids",
               values_to = "Concentrations") %>%
  group_by(`Label`, `Lipids`) %>%
  summarise(mean = mean(Concentrations),
            sd = sd(Concentrations))

# Explanation:
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe,
# 3. select() columns: `Sample Name`, `Label`, `LPC 18:2`, `Cer 39:1;O2`, `Cer 41:1;O2`, `SM 41:1;O2`,
# 4. Push it through the pipe,
# 5. Create a long data frame with pivot_longer(),
# 6. Push it through the pipe, 
# 7. Group the entries in the new long data frame by `Label` and `Lipids` for computing mean & SD,
# 8. Push it through the pipe,
# 9. Use summarise() from dplyr package to compute mean and standard deviation for every lipid in every group,
# 10. Store the computed data as 'mean.sd' in the global environment.

print(mean.sd)

And the output:

Here, as we wanted to know the mean value and standard deviation per group and for every lipid separately, we applied group_by() using Label and Lipids columns from a new long tibble.

More examples and explanations can also be found here:

arrange()

The arrange() function is useful if one would like to inspect particular columns or rows. The function allows reordering rows by one selected column or even multiple columns. Lipid or metabolite concentrations can be arranged in descending or ascending order.

More information and examples can also be found here:

See examples below:

# Using arrange() to order rows by column values:
# EXAMPLE 1 - by increasing concentrations
data.arranged.increasing <-
  data %>%
  arrange(`CE 16:1`)
  
# Explanation:
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe,
# 3. arrange() rows by increasing concentration of CE 16:1,
# 4. Store data in the global environment as 'data.arranged.increasing'. 

print(data.arranged.increasing, n = 20)

The output from EXAMPLE 1:

# EXAMPLE 2 - by decreasing concentrations
data.arranged.decreasing <-
  data %>%
  arrange(desc(`CE 16:1`))


# OR

data.arranged.decreasing <-
  data %>%
  arrange(-`CE 16:1`)
  
# Explanation:
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe,
# 3. arrange() rows by decreasing concentration of CE 16:1,
# 4. Store data in the global environment as 'data.arranged.decreasing'. 

print(data.arranged.decreasing, n = 20)

The output from EXAMPLE 2:

# EXAMPLE 3 - reordering rows by multiple variables:
data.arranged.multiple <-
  data %>%
  arrange(`Label`, -`CE 16:1`)
  
# Explanation:
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe,
# 3. arrange() rows:
#   1) First - by `Label`, 
#   2) By the decreasing concentration of CE 16:1,
# 4. Store data in the global environment as 'data.arranged.multiple'.
  
print(data.arranged.multiple, n = 20)

The output from the EXAMPLE 3:

Set n for example to 200 in the print() function to recheck the complete output from EXAMPLE 3 and the order of Label.

slice()

Using wrangling verbs from above, you know now how to select, remove, or duplicate columns of a tibble. The slice() function enables performing all these operations but on rows. Different variants of slice() enable separating the first rows via slice_head(), last rows via slice_tail(), or random rows by slice_sample() of a tibble. Using slice_max() or slice_min(), rows can also be selected based on the maximum or minimum value of a variable.

The slice function is a perfect tool to separate, e.g. lipids or metabolites with the highest fold change, or with the statistical significance, or variable importance according to a model, etc.

Below you will find examples of applications of slice() function:

# Using slice() function 
# EXAMPLE 1: slice head, 10 first entries: 
data.sliced.1 <-
  data %>%
  slice_head(n=10)

# Explanation:
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe,
# 3. slice() first 10 rows,
# 4. Store data in the global environment as 'data.sliced.1'.

print(data.sliced.1)

The output:

# EXAMPLE 2: slice tail, 10 last rows:
data.sliced.2 <-
  data %>%
  slice_tail(n=10)
  
# Explanation:
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe,
# 3. slice() last 10 rows,
# 4. Store data in the global environment as 'data.sliced.2'.

print(data.sliced.2)

And the output:

# EXAMPLE 3: slice 10 random rows:
set.seed(111)
data.sliced.3 <-
  data %>%
  slice_sample(n=10)
  
# Explanation:
# First - set.seed to 111. Simply run line: set.seed(111). 
# This will allow you to recreate exactly the same outputs as presented here.
# Use set.seed() always if a random process is performed.
# This will allow others to recreate all your steps.
# Next,
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe,
# 3. slice() 10 random rows,
# 4. Store data in the global environment as 'data.sliced.3'.

print(data.sliced.3)

The output from these lines of code:

# EXAMPLE 4: slice 10 rows with 10 highest concentrations of CE 16:1
data.sliced.4 <-
  data %>%
  slice_max(`CE 16:1`, n=10)
  
# Explanation:
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe,
# 3. slice() 10 rows corresponding to samples with 10 highest CE 16:1 concentrations,
# 4. Store data in the global environment as 'data.sliced.4'.

print(data.sliced.4)

This code results in the following output:

# EXAMPLE 5: slice 10 rows with 10 lowest concentrations of CE 16:1
data.sliced.5 <-
  data %>%
  slice_min(`CE 16:1`, n=10)
  
# Explanation:
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe,
# 3. slice() 10 rows corresponding to samples with 10 lowest CE 16:1 concentrations,
# 4. Store data in the global environment as 'data.sliced.5'.

print(data.sliced.5)

The output from EXAMPLE 5:

You can also slice rows selected by you based on the row numbers, as in the example below:

# EXAMPLE 6: slice selected rows:
data.sliced.6 <-
  data %>%
  slice(2:5, 7, 10:12)

# Explanation:
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe,
# 3. slice() rows no. from 2 to 5, no. 7, and no. from 10 to 12.
# 4. Store data in the global environment as 'data.sliced.6'.

print(data.sliced.6)