💪
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  • Practical applications of dendrograms (examples)
  • Dendrograms in R via ggtree
  1. Metabolites and lipids multivariate statistical analysis in R
  2. Hierarchical Clustering (HC)

Dendrograms

Metabolites and lipids multivariate statistical analysis in R

PreviousHierarchical Clustering (HC)NextHeat maps with clustering

Last updated 3 months ago

Dengrodrams are a simple method for depicting samples' hierarchical clustering (HC). Typically, lipids and metabolites for hierarchical clustering are pre-selected using univariate or multivariate statistical analyses. Dendrograms are frequently supplemented with heat maps to showcase various trends across the selected lipids and metabolites, aiding in the interpretation of the clustering.

Practical applications of dendrograms (examples)

Simple dendrograms (no heat map attached) are still used to present hierarchical clustering (grouping based on similarities and relationships), e.g.,

  • R. Tabassum et al. Genetic architecture of human plasma lipidome and its link to cardiovascular disease. DOI: - Fig. 2d.

To enhance their visual appeal, they are often supplemented with additional graphics or displayed using different patterns, e.g., a circular pattern:

  • K. Barrett et al. Fungal secretome profile categorization of CAZymes by function and family corresponds to fungal phylogeny and taxonomy: Example Aspergillus and Penicillium. DOI: - Fig. 2

However, in many lipidomics and metabolomics-oriented manuscripts, dendrograms are enhanced with heat maps:

  • R. Jirásko et al. Altered Plasma, Urine, and Tissue Profiles of Sulfatides and Sphingomyelins in Patients with Renal Cell Carcinoma. DOI: - Fig. 2A & 2B, Fig. 4A.

  • D. Wolrab et al. Plasma lipidomic profiles of kidney, breast and prostate cancer patients differ from healthy controls. DOI: - Fig. 7E, F, G.

  • M. T. Odenkirk et al. Combining Micropunch Histology and Multidimensional Lipidomic Measurements for In-Depth Tissue Mapping. DOI: - Fig. 2 (the circular dendrogram with a heat map is included in the manuscript's abstract, emphasizing the significance of this visualization method for conveying insights; additionally, the ggtree package mentioned below is utilized to create stunning visualizations).

  • M. T. Odenkirk et al. From Prevention to Disease Perturbations: A Multi-Omic Assessment of Exercise and Myocardial Infarctions. DOI: - Fig. 3 (ggtree package in use!)

Dendrograms in R via ggtree

R offers a great package for creating the dendrogram charts called ggtree. We will use it to present the clustering of our PDAC data set based on the most significantly altered lipid species in patients with pancreatic cancer.

First, we need to prepare our PDAC data set:

# Building ggtree dendrograms in R.
# Calling libraries:
library(tidyverse)
library(ggtree)
library(rstatix)

# Adjusting column type:
data$Label <- as.factor(data$Label)

# Filtering out patients with pancreatitis:
data.no.PAN <-
  data %>% 
  filter(Label != "PAN")
  
# Creating a long matrix:
data.long <- 
  data %>%
  select(-`Sample Name`) %>%
  pivot_longer(cols = `CE 16:1`:`SM 42:1;O2`,
               names_to = "Lipids",
               values_to = "Concentrations")
               
# Here, we perform clustering on the 12 most significant lipids from the M-W U test:
Mann.Whitney.test <- 
  data.long %>%
  group_by(Lipids) %>%
  wilcox_test(Concentrations ~ Label, 
              p.adjust.method = 'none')
              
# Separating most significant lipids:
Mann.Whitney.test.head <-
  Mann.Whitney.test %>%
  arrange(p) %>%
  slice_head(n = 12)

Lipids <- Mann.Whitney.test.head$Lipids  

# Creating tibble for hierarchical clustering:
data.selected <-
  data.no.PAN %>%
  select(`Sample Name`,
         Label, 
         all_of(Lipids))
         
# Data log10-transformation and Pareto-scaling:
data.log10 <-
  data.selected %>%
  mutate_if(is.numeric, log10)
  
Pareto.scaling <- function(x) {(x-mean(x))/sqrt(sd(x))}

data.Pareto.scaled <-
  data.log10 %>%
  mutate_if(is.numeric, ~Pareto.scaling(.))
  
# Before computing distances, we MUST name rows according to samples.
# This will be later needed to identify tree branches.
# We can use tidyverse functions: remove_rownames() & column_to_rownames():
data.Pareto.scaled <- 
  data.Pareto.scaled %>% 
  remove_rownames() %>% 
  column_to_rownames(var = "Sample Name")
  
# Now, we can compute matrix of Euclidean distances between samples (base R functions):
distance <- 
  data.Pareto.scaled %>%
  select(- Label) %>%
  dist(diag = T,
       method = 'euclidean')
       
# Hierarchical clustering using Ward.D2 algorithm (base R functions):
 clustering <- hclust(distance, method = 'ward.D2')
         
# Tibble with columns necessary to create/annotate ggtree branches: 
tip_data <- 
  data.selected %>% 
  select(`Sample Name`, Label)
  
# Selecting colors for the ggtree tips:  
colors <- c("N" = "blue", "T"="red2")

# First, we create ggtree dendrogram. 
# We select a circular shape to save space (206 samples).
# If your data set contains fewer observations, use a classic rectangular shape.
# In this case - change the layout to 'rectangular'.
ggtree(clustering, layout = "circular", size = 0.8) 

We obtain this output:

However, such a dendrogram is difficult to interpret. We will add tips with colors corresponding to our biological groups:

# Adding tips with colors corresponding to biological groups:
ggtree(clustering, layout = "circular", size = 0.5) %<+%    # %<+% ggtree operator used to pass annotations to ggtree. It is specific for ggtree library.
  tip_data +                                 # The annotations are stored in the tip_data.
  geom_tippoint(aes(color = Label)) +      # We add tips using geom_tippoint() and color according to group.
  scale_color_manual(values = colors)     # Scaling of colors.

The modified circular dendrogram:

Now, we clearly see a separation of healthy controls (blue dots) from patients with pancreatic cancer (red dots). Except for two main branches, splitting samples into two clusters corresponding to controls and PDAC patients, we see additional subclusters among healthy and PDAC patients. However, it is difficult to explain them if we do not annotate each sample. We can do this through the geom_tiplab() layer:

# Annotating samples in the ggtree through the geom_tiplab():
ggtree(clustering, layout = "circular", size = 0.5) %<+%         
  tip_data +
  geom_tippoint(aes(color = Label)) +
  scale_color_manual(values = colors) +
  geom_tiplab(label=tip_data$`Sample Name`, 
              size = 3, 
              hjust = -0.5, 
              color='black')

The look of the annotated ggtree dendrogram:

To make our dendrogram more attractive - we can add the heatmap around it. Take a look at the code below:

# Adding heat map to the ggtree.
# Create a final tree with tips:
tree <- 
  ggtree(clustering, layout = "circular", size = 0.5) %<+%         
  tip_data +
  geom_tippoint(aes(color = Label)) +
  scale_color_manual(values = c('blue','red2'))  +
  geom_treescale(x = NULL, 
                 y = NULL, 
                 color = "white", 
                 linesize = 1E-100, 
                 fontsize = 1E-100) +
  theme(legend.title = element_text(size = 14),
        legend.text = element_text(size = 14))

  
# Obtain the z-score of the concentrations in 'data.selected' tibble:
z_score <- function(x){(x-mean(x))/(sd(x))}

heatmap <-
  data.selected %>%
  select(-Label) %>%
  mutate_if(is.numeric, log10) %>%
  mutate_if(is.numeric, ~z_score(.))
  
# Name rows using `Sample Name` column:
heatmap <- column_to_rownames(heatmap, "Sample Name")

# Select colors for scale_fill_gradientn():
colors <- c("#002060", "#0d78ca", "#00e8f0", "white", "#FF4D4D", "red", "#600000")

# Add the heatmap using the gheatmap() function:
tree <- gheatmap(tree,                    # Indicate the ggtree object.
                 heatmap[,1:5],           # Select columns for the heatmap.
                 offset = 0.2,            # Heat map offset.
                 width = 0.4,             # Width of heat map 'cells'.
                 colnames = TRUE,         # Should lipid shorthand notations we presented?
                 colnames_angle = -37,    # The angle for lipid shorthand notations.
                 hjust = -0.1,            # The height of labels.
                 font.size = 4) +         # The font size of labels
  scale_fill_gradientn(colours = colors, limits = c(-5,5)) +
  labs(fill = "Z-score") 
  
# We used classic scale_fill_gradientn() for heat map fill color.
# The same way as previously - nothing new here. 
# Specify vector with fill colors and limits for the continuous color scaling.
# We want the legend title "Z-score".
  
# To keep the lipid shorthand notations readable, open the tree and rotate it:
final.tree <- 
  open_tree(tree, 25) %>% 
  rotate_tree(50)
  
# Alternatively, you can use the beautiful palette gsea through scale_fill_gsea():
# Create a tree with tips:
tree <- 
  ggtree(clustering, layout = "circular", size = 0.5) %<+%         
  tip_data +
  geom_tippoint(aes(color = Label)) +
  scale_color_manual(values = c('blue','red2'))  +
  geom_treescale(x = NULL, 
                 y = NULL, 
                 color = "white", 
                 linesize = 1E-100, 
                 fontsize = 1E-100) +
  theme(legend.title = element_text(size = 14),
        legend.text = element_text(size = 14))

# Add heat map with gsea continuous color scaling:
tree <- gheatmap(tree, 
                 heatmap[,1:5], 
                 offset = 0.2, 
                 width = 0.4, 
                 colnames = TRUE, 
                 colnames_angle = -37, 
                 hjust=-0.1, 
                 font.size = 4) +
  ggsci::scale_fill_gsea(limits = c(-5,5)) +
  labs(fill = "Z-score")

# Open and rotate the tree:
final.tree <- 
  open_tree(tree, 25) %>% 
  rotate_tree(50)
  
# To handle the graphics after creating it, we will need package ggimage.
# A direct export from RStudio may result in low-quality graphics.
# If you did not install the ggimage for exporting bar charts, you'll need it now:
install.packages("ggimage")

# Call library:
library(ggimage)

# Generate a preview and optimize the size of fonts, ggtree tips:
ggpreview(plot = final.tree,               # The object that you want to preview.
                   width = 800,               # Width in px.
                   height = 600,              # Height in px.
                   units = "px",              # Unit - of size - px.
                   dpi = 300,                 # Sharpness.
                   scale = 6)            # You may need to use a different scale.
                   
# Save the plot using ggsave (ggplot2 package - tidyverse):
ggsave(plot = final.tree,    # The R object to be saved.
  path = "C:/...",           # Here, introduce the path where the plot should be saved.
  device = "jpeg",           # Format.
  filename = "Dendrogram from ggtree - scale2.jpeg",  # File name in wd or in the selected path.
  width = 800,
  height = 600,
  units = "px",
  dpi = 300,
  scale = 6)
  
# This way, we obtain sharp, correctly scaled output.

The ggtree dendrograms with heat maps:

For more designs, we encourage you to check the articles on ggtree as well as the extensive and informative vignette of the package:

The Bioconductor website of the package:

https://doi.org/10.1038/s41467-019-11954-8
https://doi.org/10.1038/s41598-020-61907-1
https://doi.org/10.3390/cancers14194622
https://doi.org/10.1038/s41598-021-99586-1
https://doi.org/10.1021/acsmeasuresciau.1c00035
https://doi.org/10.3390/biom11010040
Preface | Data Integration, Manipulation and Visualization of Phylogenetic Trees
The full vignette of ggtree.
ggtree: tree visualization and annotation
The ggtree package on Biocondutor.
Logo
A raw circular dengrodram created using ggtree library.
The ggtree circular dendrogram presenting Ward.D2 clustering of samples based on Euclidean distances computed from normalized concentrations of 12 most significant lipids from two-sample Mann-Whitney test.
The ggtree dendrogram with sample names added to branches.
The ggtree dendrograms present the Ward.D2 clustering of samples with Euclidean distance measure computed based on normalized concentrations of 12 selected lipids. The dendrogram was surrounded by a heat map to present differences between controls and PDAC cases in levels of 5 most significant lipids according to the Mann-Whitney U test.