💪
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  • select()
  • filter()
  • mutate()
  • across() and where()
  • group_by()
  • arrange()
  • slice()
  1. PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
  2. Useful R tricks and features in OMICs mining

Data wrangling syntaxes useful in OMICs mining

Useful tricks and features in OMICs mining

In the previous subchapters, we showed how to modify data format and build pipelines. Now, it is time to move to data-wrangling verbs provided by the dplyr library (tidyverse collection). Data wrangling holds significance in the field of OMICs analysis as information seldom arrives in a format perfectly tailored for specific analyses.

Using single-word commands, you can select or rearrange columns and rows and then group and summarize them. For example, while computing descriptive statistics for a lipidomics or metabolomics data set, we can:

1) Select columns for comparisons: a factor column (our biological group) and numeric columns (with lipid or metabolite concentrations). This way, we separate interesting bits of our data from the entire data frame, which frequently contains additional information, e.g., sample names, batch numbers, clinical parameters, patient code, etc.

2) If necessary, we can filter the data by rows, e.g., to remove for computations QC samples, blanks, system standards, or even remove an entire group of samples that we do not want to analyze/compare.

3) We can arrange the samples in a specific way to examine the content after filtrations and selections.

4) We can rearrange classic wide tibbles into long ones.

5) In long tibble, we can group entries for each lipid and metabolite for further analysis.

6) Finally, we can summarize our data for each biological group separately.

You can create a new column if necessary, e.g., perform a log transformation and store the values in a new column or compute a ratio of two values.

Using pipes, you can pass data from one function to the other.

Here, we will show you how to use the following functions:

  • select()

  • filter()

  • mutate()

  • across() & where()

  • group_by()

  • arrange()

  • slice()

Some of these functions were used already in the previous examples. To use all of them, we call again tidyverse collection:

# Calling library
library(tidyverse)

select()

The function select() pulls a selected column from a data frame. Columns can be selected by their name or column numbers, as in the example below:

# Selecting columns by name from the 'data' and storing as 'data.selected':
data.selected.1 <-
  data %>% 
  select(`Sample Name`,
         `Label`, 
         `LPC 16:0`,
         `LPC 18:0`,
         `LPC 18:1`,
         `LPC 18:2`,
         `SM 32:1;O2`,
         `SM 39:1;O2`,
         `SM 41:1;O2`)
         
# In these lines, we:         
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe to select(),
# 3. Keep only columns selected using column names,
# 4. Store the output as 'data.selected.1'.
                           
print(data.selected.1)
         
# Selecting columns by column number from the 'data' and storing as 'data.selected':
data.selected.2 <-
  data %>% 
  select(1:3, 65, 125:129)
  
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe to select(),
# 3. Keep only columns selected by column number,
# 4. Store the output as 'data.selected.2'.

print(data.selected.2)

We obtain two following outputs:

More examples and information can be found here:

filter()

The function filter() is used to keep rows that fulfill a condition provided as an argument to this function. First, you have to learn what relational and logical operators can be used to formulate a condition:

# Relational operators used to formulate conditions:
# 1. == equal to
# 2. != is not equal to
# 3. > is greater than
# 4. < is less than
# 5. >= is greater than and equal to
# 6. <= is less than and equal to

# Logical operators to formulate conditions:
# 1. & AND 
# 2. | OR
# 3. ! NOT

# Additionally, it is good to know the arithmetic operators:
# 1. + addition 
# 2. - subtraction
# 3. ^ exponent 
# 4. * multiplication
# 5. / division
# 6. %% modulus

# Assignment operators (reminder):
# 1. = e.g. data = read_xlsx(file.choose())
# 2. <- e.g. data <- read_xlsx(file.choose()) - leftwards assignment (at the beginning)
# 3. -> e.g. read_xlsx(file.choose()) -> data - rightwards assignment (at the end)

Now, we will use relational and logical operators to create exemplary filters. See the code block below:

# Filtering data with filter() function based on a condition:

# EXAMPLE 1
data.filtered.1 <-
  data %>% 
  filter(Label == 'N' | Label == 'T')
  
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe to filter(),
# 3. Keep only rows containing in the Label column 'N' OR 'T',
# 4. Store the output as 'data.filtered.1'.
  
# To check the effect of the filtration run:
summary(data.filtered.1)

# As you see, even though we filtered all 'PAN' out, the PAN factor is remembered.
# To completely remove the 'PAN' factor from the data frame you need to run additionally:
data.filtered.1 <- data.filtered.1 %>% droplevels()
summary(data.filtered.1)
# The droplevels() function from base R drops unused factors in data frames.

# EXAMPLE 2
data.filtered.2 <- 
  data %>% 
  filter(`CE 16:1` > 1000)
  
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe to filter(),
# 3. Keep only rows containing the concentration of CE 16:1 over 1000,
# 4. Store the output as 'data.filtered.2'.

# Printing new tibble obtained after filtration of 'data':
print(data.filtered.2)

# EXAMPLE 3
data.filtered.3 <- 
  data %>% 
  filter(`CE 16:1` >= median(`CE 16:1`, na.rm = TRUE))
  
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe to filter(),
# 3. Keep only rows with a concentration of CE 16:1 higher than a median in this column,
# 4. Store the output as 'data.filtered.3'.
# 5. If we set the argument na.rm to TRUE, missing values will be removed.

median(data$`CE 16:1`)
# Median is equal to 538.9805. All values above or equal will be retained.
# Recheck of the filtration process:
summary(data.filtered.3)

# EXAMPLE 4
data.filtered.4 <-
  data %>% 
  filter(`CE 16:1` > median(`CE 16:1`) & Label != 'N')

# 1. Take 'data' from the global environment,
# 2. Push it through the pipe to filter(),
# 3. Keep rows with a concentration of CE 16:1 higher than 1000 and `Label` is not N,
# 4. Store the output as 'data.filtered.4'.
# Recheck of the filtration process:
summary(data.filtered.4)

The output of each of these lines is summarized below:

For more examples and explanations, go to:

mutate()

The mutate() function is used to create, modify, and delete columns in a data frame. Examples of its application are in the code block below:

# Applications of mutate() function
# EXAMPLE 1a - creating a new column
data.mutated.1a <- 
  data %>%
  select(`Sample Name`, `Label`, `LPC 18:1`, `Cer 41:1;O2`, `SM 41:1;O2`) %>%
  filter(Label == 'N' | Label == 'T') %>%
  mutate(`LPC 18:1 [pmol/mL]` = 1000 * `LPC 18:1`)

# In these lines of code, we:
# 1. Take the 'data' from the global environment,
# 2. Push it through the pipe, 
# 3. Select certain columns from the entire data set,
# 4. Push it through the pipe,
# 5. Filter to keep only rows for which column `Label` contains 'N' OR 'T',
# 6. Push it through the pipe,
# 7. We use mutate() to create one column: conc. of LPC 18:1 expressed in [pmol/mL].

print(data.mutated.1a)
# EXAMPLE 1b - creating new columns (more than one)
data.mutated.1b <- 
  data %>%
  select(`Sample Name`, `Label`, `LPC 18:1`, `Cer 41:1;O2`, `SM 41:1;O2`) %>%
  filter(Label == 'N' | Label == 'T') %>%
  mutate(`Log10(LPC 18:1)` = log10(`LPC 18:1`),
         `Log2(Cer 41:1;O2)` = log(`Cer 41:1;O2`,2),
         `Cer 41:1;O2 to SM 41:1;O2` = `Cer 41:1;O2`/`SM 41:1;O2`)

# In these lines of code, we:
# 1. Take the 'data' from the global environment,
# 2. Push it through the pipe, 
# 3. Select certain columns from the entire data set,
# 4. Push it through the pipe,
# 5. Filter to keep only rows for which column `Label` contains 'N' OR 'T',
# 6. Push it through the pipe,
# 7. We use mutate() to create three new columns:
#   1) `Log10(LPC 18:1)`, which is log10 of all concentrations of LPC 18:1,
#   2) `Log2(Cer 41:1;O2)`, which is log2 of all concentrations of Cer 41:1;O2,
#   3) `Cer 41:1;O2 to SM 41:1;O2`, which is ratio of Cer 41:1;O2 to SM 41:1;O2 conc.

print(data.mutated.1b)

And the modified tibble:

# EXAMPLE 2a - modifying content of columns:
data.mutated.2a <- 
  data %>%
  select(`Sample Name`, `Label`, `LPC 18:1`, `Cer 41:1;O2`, `SM 41:1;O2`) %>%
  filter(Label == 'N' | Label == 'T') %>%
  mutate(`LPC 18:1` = 1000 * `LPC 18:1`) 

# In these lines of code we:
# 1. Take the 'data' from the global environment,
# 2. Push it through the pipe, 
# 3. Select certain columns from the entire data set,
# 4. Push it through the pipe,
# 5. Filter to keep only rows for which column `Label` contains 'N' OR 'T',
# 6. Push it through the pipe,
# 7. We use mutate() to express the concentration of LPC 18:1 in [pmol/mL].
# Here, we modify the `LPC 18:1` by multiplying the concentration in the column by 1000, 
# No new column was created. 

And the output:

The mutate() function can be used for more complex modifications, including so-called 'reordering factor levels'. 'Reordering factor levels' sounds difficult, but it is a simple operation. Suppose you have three biological groups annotated as 'Healthy', 'Cancer', and 'Therapy'. By default, R will rely on the alphabetic order of factors. It means that when you would plot box plots or bar plots, they would automatically appear in the alphabetic order in the chart: the first group is 'Cancer', then 'Healthy', and finally, 'Therapy'. However, the alphabetic order is not how we want to arrange the groups in the plot. This is where we need to relevel factors and specify the 'Healthy' group should be the first one, the 'Cancer' should be the second one, and the 'Therapy' should be the last one. Except for the dplyr package, we will need the fct_relevel() function from the forcats package for such an operation. As a reminder, the forcats library contains tools for working with factors. Except for plotting, the levels of factors are also essential for machine learning (defining the target group), as you will see in the next chapters. Let's relevel the factors in our data set using mutate() and fct_relevel(). We will specify PDAC patients (T) as a target group (the primary group). Here is the code with explanations:

# EXAMPLE 2b
# NOTE!
# Remember that the `Label` column type has to be adjusted to factor!
# Otherwise, the output will be NULL. 
# Adjust the column type if necessary:
data$Label <- as.factor(data$Label)

# Using mutate() and fct_relevel() to relevel factors in 'data' (modifying with mutate).
# First, check the order of factors using base R function levels():
levels(data$Label)

# The output in the R console:
[1] "N"   "PAN" "T"

# As you see, the target group is now set to 'N'. The order is alphabetic.
# If you would like to plot box plots, they would appear in the order from above.
# Now, let's use mutate() to relevel factors and set 'T' as a target factor:
data <- 
  data %>%
  mutate(Label = fct_relevel(Label, "T"))
  
# Explanation:
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe to mutate(),
# 3. mutate() the `Label` column, namely, relevel the factors and set 'T' as the target.
# 4. Store it all as 'data' in the global environment.

# Now, let's check again the levels of factors:
levels(data$Label)

# The output in the R console:
[1] "T"   "N"   "PAN"

# Using fct_relevel(), one can also specify the order of other factors.
# Otherwise, as you see above, the rest will be arranged alphabetically.
# To relevel all factors according to your preferences, run:
data <- 
  data %>%
  mutate(Label = fct_relevel(Label, "T", "PAN", "N"))

# Recheck the new levels of factors:
levels(data$Label)

# And the final output in the R console:
[1] "T"   "PAN" "N"

across() and where()

We have been working with single columns so far. However, the mutate() function can be used to perform modifications of multiple columns in one line of code. Here, we will need to introduce two additional functions: across() and where(). By using across() (dplyr library), we can simplify performing the same transformation on multiple columns. In turn, where() (tidyselect library) is a selection helper. It selects variables for which a condition would return TRUE, e.g., is.charater output is TRUE. See the first example below:

# EXAMPLE 2c
# Using mutate to modify all <dbl> columns into <int>:
data.mutated.2c <- 
  data %>%
  mutate(across(where(is.numeric), as.integer))
  
# Explanation:
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe to mutate(),
# 3. mutate() across all columns where is.numeric returns T, and change them into <int>,
# 4. Store the modified tibble in the global environment as 'data.mutated.2c'. 

print(data.mutated.2c)

This line of code produces the following output:

It is also possible to introduce more than one condition as an argument for the across() function:

# EXAMPLE 2d
# Using mutate to modify selected <dbl> columns into <int>:
# Two conditions under across()
data.mutated.2d <- 
  data %>%
  mutate(across(!c(`CE 16:1`, `CE 16:0`, `CE 18:3`) & where(is.numeric), as.integer))
  
  # Explanation:
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe to mutate(),
# 3. mutate() across all columns
#    1) EXCEPT FOR `CE 16:1`, `CE 16:0`, and `CE 18:3` 
#        - provided as vector -- c() 
#        - '!' means IS.NOT  
#    AND
#    
#    2) is.numeric returns T
#    ...and change them into <int>,
# 4. Store the modified tibble in the global environment as 'data.mutated.2d'.
  
print(data.mutated.2d)

And the output:

Using mutate(), across(), and where(), you can also perform log-transformation or scaling:

# EXAMPLE 2e
# Using mutate to log-transform all numeric columns:
data.mutated.2e <- 
  data %>%
  mutate(across(where(is.numeric), log10))
  
# Explanation:
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe to mutate(),
# 3. mutate() across all columns where is.numeric returns T, and change them into log10,
# 4. Store the modified tibble in the global environment as 'data.mutated.2e'.  
  
print(data.mutated.2e)

...and output:

To perform Pareto-scaling, we must deliver a Pareto-scaling function to the mutate() function. We can also define such a function in R. See the following example below:

# EXAMPLE 2f
# Using mutate to Pareto-scale all numeric columns (advanced):
# First, a Pareto-scaling function must be created. 
# As you are starting with R, use the function below that we created for you:
Pareto.scaling <- function(x) {(x-mean(x))/sqrt(sd(x))}

# It enables performing Pareto-scaling for every element in a column of a data frame.

# Pareto-scaling is widely used in lipidomics and metabolomics.
# We are going to Pareto-scale the log-transformed data stored as 'data.mutated.2e'.
data.mutated.2f <- 
  data.mutated.2e %>%
  mutate(across(where(is.numeric), Pareto.scaling))
  
# Explanation:
# 1. Take 'data.mutated.2e' from the global environment,
# 2. Push it through the pipe to mutate(),
# 3. mutate() across all columns where is.numeric returns T, and change them into Pareto-scaled values,
# 4. Store the modified tibble in the global environment as 'data.mutated.2f'. 

print(data.mutated.2f)

The final output constitutes a data frame with log-transformed and Pareto-scaled data. Using such a data set, for example, a PCA analysis could be performed. Take a look at the final data set:

Finally, we will show you that you can also delete columns using mutate(). Simply, set the column name to NULL, to eliminate it from the data set:

# EXAMPLE 3
# Using mutate to remove columns from a data frame (tibble):
data.mutated.3 <- 
  data %>%
  mutate(`CE 16:1` = NULL,
         `CE 16:0` = NULL,
         `CE 18:3` = NULL,
         `CE 18:2` = NULL)
         
# Explanation:
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe to mutate(),
# 3. mutate(): remove columns: `CE 16:1`, `CE 16:0`, `CE 18:3`, `CE 18:2`,
# 4. Store the modified tibble in the global environment as 'data.mutated.3'. 

print(data.mutated.3)

And the output:

Additional examples and explanations are also available on the tidyverse collection website:

group_by()

The group_by() function groups data by one or more factors. The group_by() function could be applied to group data for computing statistical parameters, e.g., mean, median, or hypothesis testing per a defined group. Examples are shown below and in the next subchapters.

# Application of group_by() function, e.g. computing mean & SD
mean.sd <- 
  data %>%
  select(`Sample Name`, 
         `Label`, 
         `LPC 18:2`, 
         `Cer 39:1;O2`, 
         `Cer 41:1;O2`, 
         `SM 41:1;O2`) %>%
  pivot_longer(cols = `LPC 18:2`:`SM 41:1;O2`,
               names_to = "Lipids",
               values_to = "Concentrations") %>%
  group_by(`Label`, `Lipids`) %>%
  summarise(mean = mean(Concentrations),
            sd = sd(Concentrations))

# Explanation:
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe,
# 3. select() columns: `Sample Name`, `Label`, `LPC 18:2`, `Cer 39:1;O2`, `Cer 41:1;O2`, `SM 41:1;O2`,
# 4. Push it through the pipe,
# 5. Create a long data frame with pivot_longer(),
# 6. Push it through the pipe, 
# 7. Group the entries in the new long data frame by `Label` and `Lipids` for computing mean & SD,
# 8. Push it through the pipe,
# 9. Use summarise() from dplyr package to compute mean and standard deviation for every lipid in every group,
# 10. Store the computed data as 'mean.sd' in the global environment.

print(mean.sd)

And the output:

Here, as we wanted to know the mean value and standard deviation per group and for every lipid separately, we applied group_by() using Label and Lipids columns from a new long tibble.

More examples and explanations can also be found here:

arrange()

The arrange() function is useful if one would like to inspect particular columns or rows. The function allows reordering rows by one selected column or even multiple columns. Lipid or metabolite concentrations can be arranged in descending or ascending order.

More information and examples can also be found here:

See examples below:

# Using arrange() to order rows by column values:
# EXAMPLE 1 - by increasing concentrations
data.arranged.increasing <-
  data %>%
  arrange(`CE 16:1`)
  
# Explanation:
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe,
# 3. arrange() rows by increasing concentration of CE 16:1,
# 4. Store data in the global environment as 'data.arranged.increasing'. 

print(data.arranged.increasing, n = 20)

The output from EXAMPLE 1:

# EXAMPLE 2 - by decreasing concentrations
data.arranged.decreasing <-
  data %>%
  arrange(desc(`CE 16:1`))


# OR

data.arranged.decreasing <-
  data %>%
  arrange(-`CE 16:1`)
  
# Explanation:
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe,
# 3. arrange() rows by decreasing concentration of CE 16:1,
# 4. Store data in the global environment as 'data.arranged.decreasing'. 

print(data.arranged.decreasing, n = 20)

The output from EXAMPLE 2:

# EXAMPLE 3 - reordering rows by multiple variables:
data.arranged.multiple <-
  data %>%
  arrange(`Label`, -`CE 16:1`)
  
# Explanation:
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe,
# 3. arrange() rows:
#   1) First - by `Label`, 
#   2) By the decreasing concentration of CE 16:1,
# 4. Store data in the global environment as 'data.arranged.multiple'.
  
print(data.arranged.multiple, n = 20)

The output from the EXAMPLE 3:

Set n for example to 200 in the print() function to recheck the complete output from EXAMPLE 3 and the order of Label.

slice()

Using wrangling verbs from above, you know now how to select, remove, or duplicate columns of a tibble. The slice() function enables performing all these operations but on rows. Different variants of slice() enable separating the first rows via slice_head(), last rows via slice_tail(), or random rows by slice_sample() of a tibble. Using slice_max() or slice_min(), rows can also be selected based on the maximum or minimum value of a variable.

The slice function is a perfect tool to separate, e.g. lipids or metabolites with the highest fold change, or with the statistical significance, or variable importance according to a model, etc.

Below you will find examples of applications of slice() function:

# Using slice() function 
# EXAMPLE 1: slice head, 10 first entries: 
data.sliced.1 <-
  data %>%
  slice_head(n=10)

# Explanation:
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe,
# 3. slice() first 10 rows,
# 4. Store data in the global environment as 'data.sliced.1'.

print(data.sliced.1)

The output:

# EXAMPLE 2: slice tail, 10 last rows:
data.sliced.2 <-
  data %>%
  slice_tail(n=10)
  
# Explanation:
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe,
# 3. slice() last 10 rows,
# 4. Store data in the global environment as 'data.sliced.2'.

print(data.sliced.2)

And the output:

# EXAMPLE 3: slice 10 random rows:
set.seed(111)
data.sliced.3 <-
  data %>%
  slice_sample(n=10)
  
# Explanation:
# First - set.seed to 111. Simply run line: set.seed(111). 
# This will allow you to recreate exactly the same outputs as presented here.
# Use set.seed() always if a random process is performed.
# This will allow others to recreate all your steps.
# Next,
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe,
# 3. slice() 10 random rows,
# 4. Store data in the global environment as 'data.sliced.3'.

print(data.sliced.3)

The output from these lines of code:

# EXAMPLE 4: slice 10 rows with 10 highest concentrations of CE 16:1
data.sliced.4 <-
  data %>%
  slice_max(`CE 16:1`, n=10)
  
# Explanation:
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe,
# 3. slice() 10 rows corresponding to samples with 10 highest CE 16:1 concentrations,
# 4. Store data in the global environment as 'data.sliced.4'.

print(data.sliced.4)

This code results in the following output:

# EXAMPLE 5: slice 10 rows with 10 lowest concentrations of CE 16:1
data.sliced.5 <-
  data %>%
  slice_min(`CE 16:1`, n=10)
  
# Explanation:
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe,
# 3. slice() 10 rows corresponding to samples with 10 lowest CE 16:1 concentrations,
# 4. Store data in the global environment as 'data.sliced.5'.

print(data.sliced.5)

The output from EXAMPLE 5:

You can also slice rows selected by you based on the row numbers, as in the example below:

# EXAMPLE 6: slice selected rows:
data.sliced.6 <-
  data %>%
  slice(2:5, 7, 10:12)

# Explanation:
# 1. Take 'data' from the global environment,
# 2. Push it through the pipe,
# 3. slice() rows no. from 2 to 5, no. 7, and no. from 10 to 12.
# 4. Store data in the global environment as 'data.sliced.6'.

print(data.sliced.6)

And the output in the R console:

You will also find more examples and explanations here:

All code blocks collected in one R script can be downloaded here:

PreviousChanging data frames format with pivot_longer()NextWriting functions in R

Last updated 2 months ago

Subset columns using their names and types — select
Using select() function from the tidyverse collection.
Keep rows that match a condition — filter
Using filter() function from the tidyverse collection.
Create, modify, and delete columns — mutate
Using mutate() function from the tidyverse collection.
Mutate multiple columns — mutate_all
Using mutate() family of functions from the tidyverse collection for multiple columns.
Group by one or more variables — group_by
Using group_by() function from the tidyverse collection.
Order rows using column values — arrange
Using arrange() function from the dplyr package.
Subset rows using their positions — slice
Using slice() function from tidyverse collection.
15KB
Using select, filter, mutate, group_by, arrange, and slice functions.R
Data wrangling verbs from dplyr - R script.
A/ Output 'data.selected.1' and columns selected by name. B/ Output 'data.selected.2' and columns selected by column number.
EXAMPLE 1. Effect of data filtration: keeping rows containing 'N' or 'T' labels only (upper part). Dropping unused factor 'PAN' after filtration with droplevels().
EXAMPLE 2. 31 patients left after the filtration CE 16:1 > 1000.
EXAMPLE 3 (A). Effect of data filtration based on the CE 16:1 column median. Only values greater than and equal to 538.9805 were kept. EXAMPLE 4 (B). Effect of simultaneous data filtration based on Label (is not 'N') and values greater than the median of the CE 16:1column.
EXAMPLE 1a: Creating new column with mutate(): expressing LPC 18:1 in pmol/mL.
EXAMPLE 1b: Creating new columns with mutate(): log10 of LPC 18:1 concentrations, log2 of Cer 41:1;O2 concentrations, and a ratio of Cer 41:1;O2 to SM 41:1;O2.
EXAMPLE 2a: Modifying content of LPC 18:1 using mutate().
Using mutate() with across() to transform multiple column.
Using mutate() to change selected columns into <int> with two conditions under across().
Log-transformation of data using mutate(), across(), and where().
Log-transformed and Pareto-scaled data obtained via mutate(), across(), where(), log10(), and own Pareto.scaling() functions.
Removing selected columns from a data frame using mutate().
Grouping entries in the new long tibble by Label and Lipids for computing mean and standard deviation.
Reordering rows by increasing concentration of CE 16:1 using arrange().
Reordering rows by decreasing concentration of CE 16:1 using arrange().
Reordering rows by first - by Label, and next by the decreasing concentration of CE 16:1 using arrange().
EXAMPLE 1: First 10 rows of 'data' tibble sliced using slice_head().
EXAMPLE 2: Last 10 rows of 'data' tibble sliced using slice_tail().
EXAMPLE 3: Random 10 rows of 'data' tibble sliced using slice_sample(). Seed set to 111 allows recreating the same output.
EXAMPLE 4: 10 rows of 'data' tibble containing the highest concentration of CE 16:1 sliced using slice_max().
EXAMPLE 5: 10 rows of 'data' tibble containing the lowest concentration of CE 16:1 sliced using slice_min().
EXAMPLE 6: Selected rows of 'data' tibble sliced using slice().
Logo
Logo
Logo
Logo
Logo
Logo
Logo