💪
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  1. PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
  2. Useful R tricks and features in OMICs mining

The 'for' loop in R (advanced)

Useful tricks and features in OMICs mining

Occasionally, in this GitBook, we will need to iterate over the components of a data frame, vector, matrix, or list because not all functions or libraries we present are compatible with the tidyverse solutions.

Note!

If you are a beginner, you can skip this chapter and return to it once we use loops in the GitBook (e.g., see Metabolites and Lipids Univariate Statistics in R).

Here, we will briefly introduce the most simple loop in R, i.e., the 'for' loop. The for loop is useful when you must repeat specific lines (or blocks) of code for each element in a vector or other object. With the solutions provided by the tidyverse, loops are now less commonly used in R. However, they remain essential in many other programming languages.

Let's analyze how the for loop works. Later, we will use the loop to compute the mean, standard deviation, median, and interquartile range for all columns with lipid concentrations of our PDAC data frame.

The for-loop construction is the following:

# The 'for' loop construction in R:
# Note: this code is an abstraction that will not actually run

for (i in vec) {         # Here are defined start and end values of the loop.

Loop body: blocks of commands (statements)

}

# The 'i' represents values in a 'vec' vector: from the first to the last value of 'vec'. 
# The loop takes every 'i' from 'vec' and evaluates the block of commands for it. 

In a simplified example, we want to use the for loop to recalculate the concentration of CE 18:2 from pmol/ml of plasma to nmol/ml of plasma:

# Recalculating plasma concentration of CE 18:2 from pmol/ml to nmol/ml
# First, we create a vector with CE 18:2 concentrations in our patients' plasma samples:
concentrations <- c(2100000, 1590231, 1891203, 1999142, 1567343)

# Next, we create the loop:
for (i in concentrations) {
  value <- i / 1000  # Convert every i to nmol/ml and store it as 'value'.
  print(paste(value, "nmol/ml")) # Print the outcomes in the R console with a new unit.
}

# Comments:
# For every entry in the 'concentrations' vector, marked as i, 
# recalculate the unit from pmol/ml to nmol/ml (divide by 1000),
# and store as 'value'.
# Print the stored 'value' with a new unit, "nmol/ml," in the R console
## Here, we used the paste() function to connect every 'value' with a new unit 'nmol/ml'.

The output in the R console:

Let's try a more complicated for loop. Imagine that the tidyverse tools have not yet been developed, and you plan to use the loop to compute mean concentration across all samples, as well as standard deviation, median, and interquartile region for the PDAC data set. We will rely on base R functions, including mean(), sd(), median(), and iqr().

Note!

We will use the PDAC data set from the Introduction: Example data sets:

Here is the code with explanations:

# We load the tidyverse package as we will need one of its functions in the loop:
library(tidyverse)

# First, we load the PDAC data set into R:
data <- readxl::read_xlsx(file.choose())

# Check data structure
str(data)

# Adjust column types, if necessary. Here, `Label` must be <fct>:
data$Label <- as.factor(data$Label)

# Next, we initialize a list of results.
# It is common practice to define how the results from the loop will be stored.
# First, we will use a method with list (widespread).
results <- list()

# We extract the column names containing lipid concentrations. 
# This is to give the start-end range for the loop.
lipids <- colnames(data)[3:129]
# In fact, a vector is created.

# Now, it is time to prepare a for loop.
# Step 1: we define the range:
for (i in lipids) {} # 'i' will be every lipid name from the 'lipids' vector

# Step 2: we start building the loop body:
for (i in lipids) {
lipid_concentrations <- data[[i]]  
}
# This line of code extracts every column content, i.e., lipid concentrations.

# Step 3: we indicate where the results will be stored (results list from above).
for (i in lipids) {
  lipid_concentrations <- data[[i]]
  
  results[[i]] <- tibble()
}    
    
# We will basically create a list of tibbles for every lipid from the `data` tibble.

# Step 4: now we can start computing statistics for every column.
## First, we give every column in a new tibble obtained from every iteration a name:
for (i in lipids) {
  lipid_concentrations <- data[[i]]
  
  results[[i]] <- tibble(
    Lipid = ...,
    Mean = ...,
    SD = ...,
    Median = ...,
    IQR = ...
  )
}

## Finally, we select base R functions for computations:
for (i in lipids) {
  lipid_concentrations <- data[[i]]
  
  results[[i]] <- tibble( # Tibble created at every iteration is placed on the list.
    Lipid = i, # Lipid for which the computations were done.
    Mean = mean(lipid_concentrations, na.rm = TRUE), # mean value + we skip missing values
    SD = sd(lipid_concentrations, na.rm = TRUE), # SD + we skip missing values
    Median = median(lipid_concentrations, na.rm = TRUE), # median + we skip missing values
    IQR = IQR(lipid_concentrations, na.rm = TRUE) # IQR + we skip missing values
  )
}

# Again, the loop will produce a list of tibbles for every lipid from `data` tibble.

Though our computations are finished, the results would not be easy to analyze. For this reason, we will rearrange the list of tibbles into one tibble. We can do it in one line of code:

# Obtaining final tibble with results (single tibble for all lipids):
summary.stats <- bind_rows(results)

# Printing the results in the R console:
print(summary.stats)

Outcome:

Note!

The data can be assembled immediately into one tibble within the for loop. Instead of creating resultslist storing every output as an element of the list, we can create resultstibble and add after every iteration a new row to this tibble. Take a look at this gentle modification:

# Computing statistics through for loop and creating ready tibble with results.
# This time in one go.
# Remember that you need to call tidyverse to use the tibble() function.

# Start - end range for the loop.
lipids <- colnames(data)[3:129]
 
# Now, we create a tibble for our results (tidyverse library):
results <- tibble(Lipid = character(0), 
                  Mean = numeric(0), 
                  SD = numeric(0), 
                  Median = numeric(0), 
                  IQR = numeric(0)
                  )

# Now our loop with a gentle modification:
for (i in lipids) {
  lipid_concentrations <- data[[i]]
  
  results_col <- tibble(  # NOTE! Important: now, we don't have a list = we can't place the output tibble as an element of the list!
    Lipid = i, 
    Mean = mean(lipid_concentrations, na.rm = TRUE), 
    SD = sd(lipid_concentrations, na.rm = TRUE), 
    Median = median(lipid_concentrations, na.rm = TRUE), 
    IQR = IQR(lipid_concentrations, na.rm = TRUE) 
  )
  # So, we bind the rows of `results` tibble from above with a tibble created at every iteration.
  results <- bind_rows(results, results_col) 
}

# We can finally print the final version of `results` after all iterations:
print(results)

# Note! 
#The `results` tibble must have the same columns as `results_col` from every iteration.

This is just one example of how tidyverse simplified scripting in R - as you see, the loops can get complicated (and we haven't even started computing the results for every experimental group separately!).

Look at the following block of code (here, we will need two for loops):

# Computing descriptive statistics for PDAC data set via two 'for' loops.

# We create a tibble for our results and immediately adjust the column type.
# Remember that the tibble() function comes from the tidyverse collection!
results <- tibble(Lipid = character(0), 
                  Group = character(0), 
                  Mean = numeric(0), 
                  SD = numeric(0), 
                  Median = numeric(0), 
                  IQR = numeric(0)
                  )
                  
# Now, we can open the first loop to iterate through all lipid concentrations in `data`:
# Loop through each lipid (here we will use column number, i.e., from 3 to 129):
for (i in 3:129) {}

# Now, we extract lipid concentrations from every column:
for (i in 3:129) {
  # Extract the lipid concentrations from column "i"
  lipid_conc <- data[[i]]
}

# We open a second for loop to iterate through our groups: N, T, PAN:
for (i in 3:129) {
  lipid_conc <- data[[i]]
  
  # Iterate through the groups (N, T, PAN)
  for (group in c("N", "T", "PAN")) {
  }
}

# We extract concentrations for the current group:
for (i in 3:129) {
  lipid_conc <- data[[i]]
  
  for (group in c("N", "T", "PAN")) {
    # Obtain the data for the current group using data$Label == group from all lipid_conc
    group_conc <- lipid_conc[data$Label == group] 
    
  }
}

# Now, we can start computations (FINAL LOOP):
for (i in 3:129) {
  lipid_conc <- data[[i]]
  
  for (group in c("N", "T", "PAN")) {
    group_conc <- lipid_conc[data$Label == group]  
    
    # Calculate statistics for the current lipid in the current group and store results in lipid_result
    lipid_result <- tibble(
      Lipid = colnames(data)[i],  # Lipid name (column name from `data`)
      Group = group,              # Group name (N, T, PAN)
      Mean = mean(group_conc, na.rm = TRUE), # Compute mean (skip NAs)
      SD = sd(group_conc, na.rm = TRUE), # Compute SD (skip NAs)
      Median = median(group_conc, na.rm = TRUE), # Compute median (skip NAs)
      IQR = IQR(group_conc, na.rm = TRUE) # Compute IQR (skip NAs)
    )
    
    # Glue the results from every iteration with our results tibble created above.
    # Every iteration creates one new row to be added to the `results`.
    results <- bind_rows(results, lipid_result)
  }
}

# Finally, we can print the results:
print(results)

Output:

For more information about loops used in R, please refer to the following books:

PreviousWriting functions in RNextFundamental data structures

Last updated 2 months ago

11 Loops | Hands-On Programming with Rstatgarrett
Garrett Grolemund: "Hands-On Programming with R".
7.5 Loops | An Introduction to R
Alex Douglas, Deon Roos, Francesca Mancini, Ana Couto & David Lusseau: "An Introduction to R".
348KB
Lipidomics_dataset.xlsx
The results obtained from the for loop - used to recalculate the concentrations of CE 18:2 from pmol/ml to nmol/ml.
The results derived from the for loop, which was used to calculate the mean, standard deviation (SD), median, and interquartile range (IQR) of lipid concentrations from the PDAC dataset.
The results derived from two for loops, which were used to calculate the mean, standard deviation (SD), median, and interquartile range (IQR) of lipid concentrations from the PDAC dataset according to experimental groups (N, T, PAN).