💪
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  1. PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
  2. Useful R tricks and features in OMICs mining

Application of pipe (%>%) functions

Useful tricks and features in OMICs mining

PreviousUseful R tricks and features in OMICs miningNextChanging data frames format with pivot_longer()

Last updated 2 months ago

Before we start to work with the content of tibbles, namely sort, arrange, slice, select, filter, mutate, etc., we need to start using 'pipes'. Pipes are handy functions that allow for organizing data preparation and analysis, including manipulating the content of data frames, data transformation and normalization, computation, and plotting. Different computational operations on data can be chained using pipes, creating a pipeline. Pipelines allow for an output from a function to be immediately passed to another function for further actions. This develops a chain of actions, processing an R object into desired outputs through an elegant, simple-to-read, well-organized code block. Again, we will highlight that pipes are functions and are used to process one primary R object at a time. Pipes are supplied from the magrittr package (), which is installed together with dplyr package from the tidyverse collection. Therefore, to use pipes, we will call the tidyverse collection. In the code, pipes from the magrittr are denoted as %>%.

More information about pipes you can find here:

If a function used in the pipeline has long arguments, it is good to split them and put each argument on its own line.

If we want to keep the output from the pipeline as an R object, we can assign it at the beginning, in one line of code with a pipeline, or in two separate lines; the tidyverse style guide allows assigning objects at the end of a pipeline, too.

However, in this Gitbook, we will mostly rely on the first option. Below, you will find examples of pipelines adhering to these rules:

# Using pipes for the first time
# Step 1: call library
library(tidyverse)

# First exemplary pipe (short pipe - command in one line)
# Arranging CE 16:1 concentrations in descending order (from high to low)
data %>% arrange(desc(`CE 16:1`))

# Explanation of the pipe:
# 1) Take object 'data' from the global environment, 
# 2) Push it through the pipe (from the left to the right), 
# 3) Pull the column `CE 16:1`, 
# 4) Arrange concentrations from high to low (descending order).


# Alternatively, you could use:
data %>%
  arrange(desc(`CE 16:1`))

# or without pipe (pipe is not necessary for short operations)
arrange(data, desc(`CE 16:1`))

Now, take a look at the following block of code presenting long pipelines:

# Long pipeline - style guide for pipes (PIPELINE 1) 
data %>% 
  select(`Label`, `SM 39:1;O2`) %>%
  filter(Label == 'N' | Label == 'T') %>%
  group_by(Label) %>%
  summarize_if(is.numeric, mean)
  
# Explanation:
# 1) Take 'data' tibble from the global environment,
# 2) Push through the pipe,
# 3) Select columns: `Label` & `SM 39:1;O2` from 'data',
# 4) Push through the pipe,
# 5) Filter entries using `Label` column: keep entries containing `Label` 'N' OR 'T',
# 6) Push through the pipe,
# 7) Group by `Label`,
# 8) Push through the pipe,
# 9) Use 'summarize' if a column is numeric - calculate the mean for it. 
  
# Pipeline with functions containing long arguments: select, pivot_longer, summarise; 
# (PIPELINE 2)
data %>% 
  select(`Sample Name`,
         `Label`, 
         `LPC 16:0`,
         `LPC 18:0`,
         `LPC 18:1`,
         `LPC 18:2`,
         `SM 32:1;O2`,
         `SM 39:1;O2`,
         `SM 41:1;O2`) %>%
  pivot_longer(cols = `LPC 16:0`:`SM 41:1;O2`, 
               names_to = 'Lipids', 
               values_to = 'Concentration') %>%
filter(Label == 'N' | Label == 'T') %>%
  group_by(Label, Lipids) %>%
  summarise(
    `Mean concentration` = mean(Concentration),
    `SD concentration` = sd(Concentration))

# Explanation:
# 1) Take 'data' tibble from the global environment,
# 2) Push through the pipe,
# 3) Select columns,
# 4) Change the wide data frame into a long data frame,
# 5) Filter entries using `Label` column: keep 'N' OR 'T' entries,
# 6) Push through the pipe,
# 7) Group entries by `Label` and then `Lipids` (long table!),
# 8) Summarize the long table grouping data by `Label` & `Lipids`: calculate mean & sd

The output of both pipelines looks like this:

Finally, we will show you how you can store the outputs of pipelines. Please see the code block below:

# Three ways of storing the output of pipelines (assignment):
# 1) At the beginning of the code, in one line with the pipeline, 
# as 'mean.value' object, with <-:

mean.value <- data %>% 
  select(`Label`, `SM 39:1;O2`) %>%
  filter(Label == 'N' | Label == 'T') %>%
  group_by(Label) %>%
  summarize_if(is.numeric, mean)
  
# 2) At the beginning of the code, in two separate lines, 
# as 'mean.value' object, with <-:

mean.value <- 
  data %>% 
  select(`Label`, `SM 39:1;O2`) %>%
  filter(Label == 'N' | Label == 'T') %>%
  group_by(Label) %>%
  summarize_if(is.numeric, mean)
  
# 3) At the end of the code, as 'mean.value' object, with ->:

data %>% 
  select(`Label`, `SM 39:1;O2`) %>%
  filter(Label == 'N' | Label == 'T') %>%
  group_by(Label) %>%
  summarize_if(is.numeric, mean) ->
  mean.value

You can find the script containing all of the code blocks here:

Now that you know how the pipe function works, let's actively use it for OMICs analysis in the following chapters!

According to the tidyverse style guide (), in the code, before a pipe symbol, one should leave a space. Every new step in the pipeline should be started from a new line to increase readability, except for very short pipelines. After the first step in the pipeline, the next step in the new line should be indented by two spaces. This way, it is easier to add additional lines of code and less likely to overlook a step of a process.

https://style.tidyverse.org/pipes.html
https://magrittr.tidyverse.org/
An Introduction to the Pipe in RTowards Data Science
Beginner’s Guide to Piping Data in RThe CodeHub
11 Using Pipes | R for Epidemiology
Pipe — %>%
Logo
Logo
3KB
Pipes in R via magrittr (dplyr - tidyverse).R
All blocks of code from this subchapter gathered in one R script.
Output of the first pipeline.
Console after executing lines of code for PIPELINE 1 and 2: A/ Executed code and output of PIPELINE 1 - means calculated for N and T for concentrations of SM 39:1;O2. B/ Executed code and output of PIPELINE 2 - means and standard deviations calculated for N and T for concentrations of selected lipids.
Logo
Logo