Dendrograms

Metabolites and lipids multivariate statistical analysis in R

Dengrodrams are a simple method for depicting samples' hierarchical clustering (HC). Typically, lipids and metabolites for hierarchical clustering are pre-selected using univariate or multivariate statistical analyses. Dendrograms are frequently supplemented with heat maps to showcase various trends across the selected lipids and metabolites, aiding in the interpretation of the clustering.

Practical applications of dendrograms (examples)

Simple dendrograms (no heat map attached) are still used to present hierarchical clustering (grouping based on similarities and relationships), e.g.,

R. Tabassum et al. Genetic architecture of human plasma lipidome and its link to cardiovascular disease. DOI: https://doi.org/10.1038/s41467-019-11954-8 - Fig. 2d.

To enhance their visual appeal, they are often supplemented with additional graphics or displayed using different patterns, e.g., a circular pattern:

K. Barrett et al. Fungal secretome profile categorization of CAZymes by function and family corresponds to fungal phylogeny and taxonomy: Example Aspergillus and Penicillium. DOI: https://doi.org/10.1038/s41598-020-61907-1 - Fig. 2

However, in many lipidomics and metabolomics-oriented manuscripts, dendrograms are enhanced with heat maps:

R. Jirásko et al. Altered Plasma, Urine, and Tissue Profiles of Sulfatides and Sphingomyelins in Patients with Renal Cell Carcinoma. DOI: https://doi.org/10.3390/cancers14194622 - Fig. 2A & 2B, Fig. 4A.
D. Wolrab et al. Plasma lipidomic profiles of kidney, breast and prostate cancer patients differ from healthy controls. DOI: https://doi.org/10.1038/s41598-021-99586-1 - Fig. 7E, F, G.
M. T. Odenkirk et al. Combining Micropunch Histology and Multidimensional Lipidomic Measurements for In-Depth Tissue Mapping. DOI: https://doi.org/10.1021/acsmeasuresciau.1c00035 - Fig. 2 (the circular dendrogram with a heat map is included in the manuscript's abstract, emphasizing the significance of this visualization method for conveying insights; additionally, the ggtree package mentioned below is utilized to create stunning visualizations).
M. T. Odenkirk et al. From Prevention to Disease Perturbations: A Multi-Omic Assessment of Exercise and Myocardial Infarctions. DOI: https://doi.org/10.3390/biom11010040 - Fig. 3 (ggtree package in use!)

Dendrograms in R via ggtree

R offers a great package for creating the dendrogram charts called ggtree. We will use it to present the clustering of our PDAC data set based on the most significantly altered lipid species in patients with pancreatic cancer.

First, we need to prepare our PDAC data set:

# Building ggtree dendrograms in R.
# Calling libraries:
library(tidyverse)
library(ggtree)
library(rstatix)

# Adjusting column type:
data$Label <- as.factor(data$Label)

# Filtering out patients with pancreatitis:
data.no.PAN <-
  data %>% 
  filter(Label != "PAN")
  
# Creating a long matrix:
data.long <- 
  data %>%
  select(-`Sample Name`) %>%
  pivot_longer(cols = `CE 16:1`:`SM 42:1;O2`,
               names_to = "Lipids",
               values_to = "Concentrations")
               
# Here, we perform clustering on the 12 most significant lipids from the M-W U test:
Mann.Whitney.test <- 
  data.long %>%
  group_by(Lipids) %>%
  wilcox_test(Concentrations ~ Label, 
              p.adjust.method = 'none')
              
# Separating most significant lipids:
Mann.Whitney.test.head <-
  Mann.Whitney.test %>%
  arrange(p) %>%
  slice_head(n = 12)

Lipids <- Mann.Whitney.test.head$Lipids  

# Creating tibble for hierarchical clustering:
data.selected <-
  data.no.PAN %>%
  select(`Sample Name`,
         Label, 
         all_of(Lipids))
         
# Data log10-transformation and Pareto-scaling:
data.log10 <-
  data.selected %>%
  mutate_if(is.numeric, log10)
  
Pareto.scaling <- function(x) {(x-mean(x))/sqrt(sd(x))}

data.Pareto.scaled <-
  data.log10 %>%
  mutate_if(is.numeric, ~Pareto.scaling(.))
  
# Before computing distances, we MUST name rows according to samples.
# This will be later needed to identify tree branches.
# We can use tidyverse functions: remove_rownames() & column_to_rownames():
data.Pareto.scaled <- 
  data.Pareto.scaled %>% 
  remove_rownames() %>% 
  column_to_rownames(var = "Sample Name")
  
# Now, we can compute matrix of Euclidean distances between samples (base R functions):
distance <- 
  data.Pareto.scaled %>%
  select(- Label) %>%
  dist(diag = T,
       method = 'euclidean')
       
# Hierarchical clustering using Ward.D2 algorithm (base R functions):
 clustering <- hclust(distance, method = 'ward.D2')
         
# Tibble with columns necessary to create/annotate ggtree branches: 
tip_data <- 
  data.selected %>% 
  select(`Sample Name`, Label)
  
# Selecting colors for the ggtree tips:  
colors <- c("N" = "blue", "T"="red2")

# First, we create ggtree dendrogram. 
# We select a circular shape to save space (206 samples).
# If your data set contains fewer observations, use a classic rectangular shape.
# In this case - change the layout to 'rectangular'.
ggtree(clustering, layout = "circular", size = 0.8)

We obtain this output:

However, such a dendrogram is difficult to interpret. We will add tips with colors corresponding to our biological groups:

# Adding tips with colors corresponding to biological groups:
ggtree(clustering, layout = "circular", size = 0.5) %<+%    # %<+% ggtree operator used to pass annotations to ggtree. It is specific for ggtree library.
  tip_data +                                 # The annotations are stored in the tip_data.
  geom_tippoint(aes(color = Label)) +      # We add tips using geom_tippoint() and color according to group.
  scale_color_manual(values = colors)     # Scaling of colors.

The modified circular dendrogram:

Now, we clearly see a separation of healthy controls (blue dots) from patients with pancreatic cancer (red dots). Except for two main branches, splitting samples into two clusters corresponding to controls and PDAC patients, we see additional subclusters among healthy and PDAC patients. However, it is difficult to explain them if we do not annotate each sample. We can do this through the geom_tiplab() layer:

# Annotating samples in the ggtree through the geom_tiplab():
ggtree(clustering, layout = "circular", size = 0.5) %<+%         
  tip_data +
  geom_tippoint(aes(color = Label)) +
  scale_color_manual(values = colors) +
  geom_tiplab(label=tip_data$`Sample Name`, 
              size = 3, 
              hjust = -0.5, 
              color='black')

The look of the annotated ggtree dendrogram:

To make our dendrogram more attractive - we can add the heatmap around it. Take a look at the code below:

# Adding heat map to the ggtree.
# Create a final tree with tips:
tree <- 
  ggtree(clustering, layout = "circular", size = 0.5) %<+%         
  tip_data +
  geom_tippoint(aes(color = Label)) +
  scale_color_manual(values = c('blue','red2'))  +
  geom_treescale(x = NULL, 
                 y = NULL, 
                 color = "white", 
                 linesize = 1E-100, 
                 fontsize = 1E-100) +
  theme(legend.title = element_text(size = 14),
        legend.text = element_text(size = 14))

  
# Obtain the z-score of the concentrations in 'data.selected' tibble:
z_score <- function(x){(x-mean(x))/(sd(x))}

heatmap <-
  data.selected %>%
  select(-Label) %>%
  mutate_if(is.numeric, log10) %>%
  mutate_if(is.numeric, ~z_score(.))
  
# Name rows using `Sample Name` column:
heatmap <- column_to_rownames(heatmap, "Sample Name")

# Select colors for scale_fill_gradientn():
colors <- c("#002060", "#0d78ca", "#00e8f0", "white", "#FF4D4D", "red", "#600000")

# Add the heatmap using the gheatmap() function:
tree <- gheatmap(tree,                    # Indicate the ggtree object.
                 heatmap[,1:5],           # Select columns for the heatmap.
                 offset = 0.2,            # Heat map offset.
                 width = 0.4,             # Width of heat map 'cells'.
                 colnames = TRUE,         # Should lipid shorthand notations we presented?
                 colnames_angle = -37,    # The angle for lipid shorthand notations.
                 hjust = -0.1,            # The height of labels.
                 font.size = 4) +         # The font size of labels
  scale_fill_gradientn(colours = colors, limits = c(-5,5)) +
  labs(fill = "Z-score") 
  
# We used classic scale_fill_gradientn() for heat map fill color.
# The same way as previously - nothing new here. 
# Specify vector with fill colors and limits for the continuous color scaling.
# We want the legend title "Z-score".
  
# To keep the lipid shorthand notations readable, open the tree and rotate it:
final.tree <- 
  open_tree(tree, 25) %>% 
  rotate_tree(50)
  
# Alternatively, you can use the beautiful palette gsea through scale_fill_gsea():
# Create a tree with tips:
tree <- 
  ggtree(clustering, layout = "circular", size = 0.5) %<+%         
  tip_data +
  geom_tippoint(aes(color = Label)) +
  scale_color_manual(values = c('blue','red2'))  +
  geom_treescale(x = NULL, 
                 y = NULL, 
                 color = "white", 
                 linesize = 1E-100, 
                 fontsize = 1E-100) +
  theme(legend.title = element_text(size = 14),
        legend.text = element_text(size = 14))

# Add heat map with gsea continuous color scaling:
tree <- gheatmap(tree, 
                 heatmap[,1:5], 
                 offset = 0.2, 
                 width = 0.4, 
                 colnames = TRUE, 
                 colnames_angle = -37, 
                 hjust=-0.1, 
                 font.size = 4) +
  ggsci::scale_fill_gsea(limits = c(-5,5)) +
  labs(fill = "Z-score")

# Open and rotate the tree:
final.tree <- 
  open_tree(tree, 25) %>% 
  rotate_tree(50)
  
# To handle the graphics after creating it, we will need package ggimage.
# A direct export from RStudio may result in low-quality graphics.
# If you did not install the ggimage for exporting bar charts, you'll need it now:
install.packages("ggimage")

# Call library:
library(ggimage)

# Generate a preview and optimize the size of fonts, ggtree tips:
ggpreview(plot = final.tree,               # The object that you want to preview.
                   width = 800,               # Width in px.
                   height = 600,              # Height in px.
                   units = "px",              # Unit - of size - px.
                   dpi = 300,                 # Sharpness.
                   scale = 6)            # You may need to use a different scale.
                   
# Save the plot using ggsave (ggplot2 package - tidyverse):
ggsave(plot = final.tree,    # The R object to be saved.
  path = "C:/...",           # Here, introduce the path where the plot should be saved.
  device = "jpeg",           # Format.
  filename = "Dendrogram from ggtree - scale2.jpeg",  # File name in wd or in the selected path.
  width = 800,
  height = 600,
  units = "px",
  dpi = 300,
  scale = 6)
  
# This way, we obtain sharp, correctly scaled output.

The ggtree dendrograms with heat maps:

For more designs, we encourage you to check the articles on ggtree as well as the extensive and informative vignette of the package:

The Bioconductor website of the package:

PreviousHierarchical Clustering (HC)NextHeat maps with clustering

Last updated 4 months ago