Data wrangling syntaxes useful in OMICs mining

Useful tricks and features in OMICs mining

In the previous subchapters, we showed how to modify data format and build pipelines. Now, it is time to move to data-wrangling verbs provided by the dplyr library (tidyverse collection). Data wrangling holds significance in the field of OMICs analysis as information seldom arrives in a format perfectly tailored for specific analyses.

Using single-word commands, you can select or rearrange columns and rows and then group and summarize them. For example, while computing descriptive statistics for a lipidomics or metabolomics data set, we can:

1) Select columns for comparisons: a factor column (our biological group) and numeric columns (with lipid or metabolite concentrations). This way, we separate interesting bits of our data from the entire data frame, which frequently contains additional information, e.g., sample names, batch numbers, clinical parameters, patient code, etc.

2) If necessary, we can filter the data by rows, e.g., to remove for computations QC samples, blanks, system standards, or even remove an entire group of samples that we do not want to analyze/compare.

3) We can arrange the samples in a specific way to examine the content after filtrations and selections.

4) We can rearrange classic wide tibbles into long ones.

5) In long tibble, we can group entries for each lipid and metabolite for further analysis.

6) Finally, we can summarize our data for each biological group separately.

You can create a new column if necessary, e.g., perform a log transformation and store the values in a new column or compute a ratio of two values.

Using pipes, you can pass data from one function to the other.

Here, we will show you how to use the following functions:

  • select()

  • filter()

  • mutate()

  • across() & where()

  • group_by()

  • arrange()

  • slice()

Some of these functions were used already in the previous examples. To use all of them, we call again tidyverse collection:

select()

The function select() pulls a selected column from a data frame. Columns can be selected by their name or column numbers, as in the example below:

We obtain two following outputs:

A/ Output 'data.selected.1' and columns selected by name. B/ Output 'data.selected.2' and columns selected by column number.

More examples and information can be found here:

Using select() function from the tidyverse collection.

filter()

The function filter() is used to keep rows that fulfill a condition provided as an argument to this function. First, you have to learn what relational and logical operators can be used to formulate a condition:

Now, we will use relational and logical operators to create exemplary filters. See the code block below:

The output of each of these lines is summarized below:

EXAMPLE 1. Effect of data filtration: keeping rows containing 'N' or 'T' labels only (upper part). Dropping unused factor 'PAN' after filtration with droplevels().
EXAMPLE 2. 31 patients left after the filtration CE 16:1 > 1000.
EXAMPLE 3 (A). Effect of data filtration based on the CE 16:1 column median. Only values greater than and equal to 538.9805 were kept. EXAMPLE 4 (B). Effect of simultaneous data filtration based on Label (is not 'N') and values greater than the median of the CE 16:1column.

For more examples and explanations, go to:

Using filter() function from the tidyverse collection.

mutate()

The mutate() function is used to create, modify, and delete columns in a data frame. Examples of its application are in the code block below:

EXAMPLE 1a: Creating new column with mutate(): expressing LPC 18:1 in pmol/mL.

And the modified tibble:

EXAMPLE 1b: Creating new columns with mutate(): log10 of LPC 18:1 concentrations, log2 of Cer 41:1;O2 concentrations, and a ratio of Cer 41:1;O2 to SM 41:1;O2.

And the output:

EXAMPLE 2a: Modifying content of LPC 18:1 using mutate().

The mutate() function can be used for more complex modifications, including so-called 'reordering factor levels'. 'Reordering factor levels' sounds difficult, but it is a simple operation. Suppose you have three biological groups annotated as 'Healthy', 'Cancer', and 'Therapy'. By default, R will rely on the alphabetic order of factors. It means that when you would plot box plots or bar plots, they would automatically appear in the alphabetic order in the chart: the first group is 'Cancer', then 'Healthy', and finally, 'Therapy'. However, the alphabetic order is not how we want to arrange the groups in the plot. This is where we need to relevel factors and specify the 'Healthy' group should be the first one, the 'Cancer' should be the second one, and the 'Therapy' should be the last one. Except for the dplyr package, we will need the fct_relevel() function from the forcats package for such an operation. As a reminder, the forcats library contains tools for working with factors. Except for plotting, the levels of factors are also essential for machine learning (defining the target group), as you will see in the next chapters. Let's relevel the factors in our data set using mutate() and fct_relevel(). We will specify PDAC patients (T) as a target group (the primary group). Here is the code with explanations:

across() and where()

We have been working with single columns so far. However, the mutate() function can be used to perform modifications of multiple columns in one line of code. Here, we will need to introduce two additional functions: across() and where(). By using across() (dplyr library), we can simplify performing the same transformation on multiple columns. In turn, where() (tidyselect library) is a selection helper. It selects variables for which a condition would return TRUE, e.g., is.charater output is TRUE. See the first example below:

This line of code produces the following output:

Using mutate() with across() to transform multiple column.

It is also possible to introduce more than one condition as an argument for the across() function:

And the output:

Using mutate() to change selected columns into <int> with two conditions under across().

Using mutate(), across(), and where(), you can also perform log-transformation or scaling:

...and output:

Log-transformation of data using mutate(), across(), and where().

To perform Pareto-scaling, we must deliver a Pareto-scaling function to the mutate() function. We can also define such a function in R. See the following example below:

The final output constitutes a data frame with log-transformed and Pareto-scaled data. Using such a data set, for example, a PCA analysis could be performed. Take a look at the final data set:

Log-transformed and Pareto-scaled data obtained via mutate(), across(), where(), log10(), and own Pareto.scaling() functions.

Finally, we will show you that you can also delete columns using mutate(). Simply, set the column name to NULL, to eliminate it from the data set:

And the output:

Removing selected columns from a data frame using mutate().

Additional examples and explanations are also available on the tidyverse collection website:

Using mutate() function from the tidyverse collection.
Using mutate() family of functions from the tidyverse collection for multiple columns.

group_by()

The group_by() function groups data by one or more factors. The group_by() function could be applied to group data for computing statistical parameters, e.g., mean, median, or hypothesis testing per a defined group. Examples are shown below and in the next subchapters.

And the output:

Grouping entries in the new long tibble by Label and Lipids for computing mean and standard deviation.

Here, as we wanted to know the mean value and standard deviation per group and for every lipid separately, we applied group_by() using Label and Lipids columns from a new long tibble.

More examples and explanations can also be found here:

Using group_by() function from the tidyverse collection.

arrange()

The arrange() function is useful if one would like to inspect particular columns or rows. The function allows reordering rows by one selected column or even multiple columns. Lipid or metabolite concentrations can be arranged in descending or ascending order.

More information and examples can also be found here:

Using arrange() function from the dplyr package.

See examples below:

The output from EXAMPLE 1:

Reordering rows by increasing concentration of CE 16:1 using arrange().

The output from EXAMPLE 2:

Reordering rows by decreasing concentration of CE 16:1 using arrange().

The output from the EXAMPLE 3:

Reordering rows by first - by Label, and next by the decreasing concentration of CE 16:1 using arrange().

Set n for example to 200 in the print() function to recheck the complete output from EXAMPLE 3 and the order of Label.

slice()

Using wrangling verbs from above, you know now how to select, remove, or duplicate columns of a tibble. The slice() function enables performing all these operations but on rows. Different variants of slice() enable separating the first rows via slice_head(), last rows via slice_tail(), or random rows by slice_sample() of a tibble. Using slice_max() or slice_min(), rows can also be selected based on the maximum or minimum value of a variable.

The slice function is a perfect tool to separate, e.g. lipids or metabolites with the highest fold change, or with the statistical significance, or variable importance according to a model, etc.

Below you will find examples of applications of slice() function:

The output:

EXAMPLE 1: First 10 rows of 'data' tibble sliced using slice_head().

And the output:

EXAMPLE 2: Last 10 rows of 'data' tibble sliced using slice_tail().

The output from these lines of code:

EXAMPLE 3: Random 10 rows of 'data' tibble sliced using slice_sample(). Seed set to 111 allows recreating the same output.

This code results in the following output:

EXAMPLE 4: 10 rows of 'data' tibble containing the highest concentration of CE 16:1 sliced using slice_max().

The output from EXAMPLE 5:

EXAMPLE 5: 10 rows of 'data' tibble containing the lowest concentration of CE 16:1 sliced using slice_min().

You can also slice rows selected by you based on the row numbers, as in the example below:

And the output in the R console:

EXAMPLE 6: Selected rows of 'data' tibble sliced using slice().

You will also find more examples and explanations here:

Using slice() function from tidyverse collection.

All code blocks collected in one R script can be downloaded here:

Data wrangling verbs from dplyr - R script.

Last updated