Dendrograms

Metabolites and lipids multivariate statistical analysis in R

Dengrodrams are a simple method for depicting samples' hierarchical clustering (HC). Typically, lipids and metabolites for hierarchical clustering are pre-selected using univariate or multivariate statistical analyses. Dendrograms are frequently supplemented with heat maps to showcase various trends across the selected lipids and metabolites, aiding in the interpretation of the clustering.

Practical applications of dendrograms (examples)

Simple dendrograms (no heat map attached) are still used to present hierarchical clustering (grouping based on similarities and relationships), e.g.,

To enhance their visual appeal, they are often supplemented with additional graphics or displayed using different patterns, e.g., a circular pattern:

However, in many lipidomics and metabolomics-oriented manuscripts, dendrograms are enhanced with heat maps:

  • R. Jirásko et al. Altered Plasma, Urine, and Tissue Profiles of Sulfatides and Sphingomyelins in Patients with Renal Cell Carcinoma. DOI: https://doi.org/10.3390/cancers14194622arrow-up-right - Fig. 2A & 2B, Fig. 4A.

  • D. Wolrab et al. Plasma lipidomic profiles of kidney, breast and prostate cancer patients differ from healthy controls. DOI: https://doi.org/10.1038/s41598-021-99586-1arrow-up-right - Fig. 7E, F, G.

  • M. T. Odenkirk et al. Combining Micropunch Histology and Multidimensional Lipidomic Measurements for In-Depth Tissue Mapping. DOI: https://doi.org/10.1021/acsmeasuresciau.1c00035arrow-up-right - Fig. 2 (the circular dendrogram with a heat map is included in the manuscript's abstract, emphasizing the significance of this visualization method for conveying insights; additionally, the ggtree package mentioned below is utilized to create stunning visualizations).

  • M. T. Odenkirk et al. From Prevention to Disease Perturbations: A Multi-Omic Assessment of Exercise and Myocardial Infarctions. DOI: https://doi.org/10.3390/biom11010040arrow-up-right - Fig. 3 (ggtree package in use!)

Dendrograms in R via ggtree

R offers a great package for creating the dendrogram charts called ggtree. We will use it to present the clustering of our PDAC data set based on the most significantly altered lipid species in patients with pancreatic cancer.

First, we need to prepare our PDAC data set:

# Building ggtree dendrograms in R.
# Calling libraries:
library(tidyverse)
library(ggtree)
library(rstatix)

# Adjusting column type:
data$Label <- as.factor(data$Label)

# Filtering out patients with pancreatitis:
data.no.PAN <-
  data %>% 
  filter(Label != "PAN")
  
# Creating a long matrix:
data.long <- 
  data %>%
  select(-`Sample Name`) %>%
  pivot_longer(cols = `CE 16:1`:`SM 42:1;O2`,
               names_to = "Lipids",
               values_to = "Concentrations")
               
# Here, we perform clustering on the 12 most significant lipids from the M-W U test:
Mann.Whitney.test <- 
  data.long %>%
  group_by(Lipids) %>%
  wilcox_test(Concentrations ~ Label, 
              p.adjust.method = 'none')
              
# Separating most significant lipids:
Mann.Whitney.test.head <-
  Mann.Whitney.test %>%
  arrange(p) %>%
  slice_head(n = 12)

Lipids <- Mann.Whitney.test.head$Lipids  

# Creating tibble for hierarchical clustering:
data.selected <-
  data.no.PAN %>%
  select(`Sample Name`,
         Label, 
         all_of(Lipids))
         
# Data log10-transformation and Pareto-scaling:
data.log10 <-
  data.selected %>%
  mutate_if(is.numeric, log10)
  
Pareto.scaling <- function(x) {(x-mean(x))/sqrt(sd(x))}

data.Pareto.scaled <-
  data.log10 %>%
  mutate_if(is.numeric, ~Pareto.scaling(.))
  
# Before computing distances, we MUST name rows according to samples.
# This will be later needed to identify tree branches.
# We can use tidyverse functions: remove_rownames() & column_to_rownames():
data.Pareto.scaled <- 
  data.Pareto.scaled %>% 
  remove_rownames() %>% 
  column_to_rownames(var = "Sample Name")
  
# Now, we can compute matrix of Euclidean distances between samples (base R functions):
distance <- 
  data.Pareto.scaled %>%
  select(- Label) %>%
  dist(diag = T,
       method = 'euclidean')
       
# Hierarchical clustering using Ward.D2 algorithm (base R functions):
 clustering <- hclust(distance, method = 'ward.D2')
         
# Tibble with columns necessary to create/annotate ggtree branches: 
tip_data <- 
  data.selected %>% 
  select(`Sample Name`, Label)
  
# Selecting colors for the ggtree tips:  
colors <- c("N" = "blue", "T"="red2")

# First, we create ggtree dendrogram. 
# We select a circular shape to save space (206 samples).
# If your data set contains fewer observations, use a classic rectangular shape.
# In this case - change the layout to 'rectangular'.
ggtree(clustering, layout = "circular", size = 0.8) 

We obtain this output:

A raw circular dengrodram created using ggtree library.

However, such a dendrogram is difficult to interpret. We will add tips with colors corresponding to our biological groups:

The modified circular dendrogram:

The ggtree circular dendrogram presenting Ward.D2 clustering of samples based on Euclidean distances computed from normalized concentrations of 12 most significant lipids from two-sample Mann-Whitney test.

Now, we clearly see a separation of healthy controls (blue dots) from patients with pancreatic cancer (red dots). Except for two main branches, splitting samples into two clusters corresponding to controls and PDAC patients, we see additional subclusters among healthy and PDAC patients. However, it is difficult to explain them if we do not annotate each sample. We can do this through the geom_tiplab() layer:

The look of the annotated ggtree dendrogram:

The ggtree dendrogram with sample names added to branches.

To make our dendrogram more attractive - we can add the heatmap around it. Take a look at the code below:

The ggtree dendrograms with heat maps:

The ggtree dendrograms present the Ward.D2 clustering of samples with Euclidean distance measure computed based on normalized concentrations of 12 selected lipids. The dendrogram was surrounded by a heat map to present differences between controls and PDAC cases in levels of 5 most significant lipids according to the Mann-Whitney U test.

For more designs, we encourage you to check the articles on ggtree as well as the extensive and informative vignette of the package:

The full vignette of ggtree.

The Bioconductor website of the package:

The ggtree package on Biocondutor.

Last updated