Histograms

Metabolites and lipids descriptive statistical analysis in R

Histograms are used to represent the frequency distributions. Values in the histogram are usually presented in bins. Every bar (rectangle) of the histogram could be referred to as 'how often these values occur in a data set'. Histograms are less frequently used to present data in manuscripts, but they are useful to assess visually, e.g., the skewness or symmetry of the distribution. Thus, histograms are used in the first steps of data preparation for statistical analysis (or data inspection), but they are less frequently presented in manuscripts in such a form.

Histograms can be used to present data, such as the distribution of particle sizes, e.g., drug-containing nanorods or exosomes released by cells:

M. R. Abedin et al. Antibody–drug nanoparticle induces synergistic treatment efficacies in HER2 positive breast cancer cells. DOI: https://doi.org/10.1038/s41598-021-86762-6 - Fig. 1 (c) & (d) (presenting drug-containing nanorods length and diameter as histograms, i.e., particle count is plotted vs. length or diameter bins).
G. K. Patel et al. Comparative analysis of exosome isolation methods using culture supernatant for optimum yield, purity and downstream applications. DOI: https://doi.org/10.1038/s41598-019-41800-2 - Fig. 2 (histograms were used to display the percentage of intensity (frequency) associated with exosome groups of specific radius value ranges).

Other examples are, for instance, the relative abundance of different C=C location isomers (similarly, the histogram could be used for presenting lipid concentration distributions between two or more experimental groups) :

Z. Li et al. Single-cell lipidomics with high structural specificity by mass spectrometry. DOI: https://doi.org/10.1038/s41467-021-23161-5 - Fig. 4b, c, d (study in Nature Communications showing histograms of relative abundance of C=C location isomers of PC 34:2 in four types of single cells).

Or elements of technical validations of analytical methods related to frequency aspects, e.g.:

D. K. Barupal et al. Generation and quality control of lipidomics data for the alzheimer’s disease neuroimaging initiative cohort. DOI: https://doi.org/10.1038/sdata.2018.263 - Fig. 4 (the authors prepared a histogram that shows the distribution of RSD (%) for compounds in the QC samples (x-axis) vs. frequency of occurrence (y-axis), presenting the reproducibility of peak heights for the detected compounds in QC samples).

Or for identifying discordant lipids/metabolite features in statistical analyses performed on two similar data sets:

A. Dakic et al. Imputation of plasma lipid species to facilitate integration of lipidomic datasets. DOI: https://doi.org/10.1038/s41467-024-45838-3 (Fig. 1a).

The histogram and kernel density curve were used to present the distribution of heritability estimates across all the lipid species in another study:

R. Tabassum et al. Genetic architecture of human plasma lipidome and its link to cardiovascular disease. DOI: https://doi.org/10.1038/s41467-019-11954-8 - Fig. 2a (upper part of panel).

Authors of the following manuscript published in Clinics and Research in Hepatology and Gastroenterology (Elsevier) used histograms for the comparison of probability distributions of circulating TG and glucose levels between healthy individuals and hepatic steatosis patients:

J. Zhou et al. Probabilistic Scatter Plots for visualizing carbohydrate and lipid metabolism states in Non-Alcoholic Fatty Liver Disease. DOI: https://doi.org/10.1016/j.clinre.2024.102365 - Fig. 1 & 2.

Preparing histograms via dlookr (level: basic)

The dlookr library contains the function plot_normality(), which can be connected with tidyverse tools. This function immediately enables us to assess how the distribution changes if transformation methods are applied as additional histograms are plotted for the transformed data on the left and right side on the bottom. A user can select between multiple transformation methods through arguments left and right. The color of the histogram can be changed through the col argument. Together with the histogram, a Q-Q plot is produced. The Q-Q plot also enables assessing the data distribution. If points (dots) in the Q-Q plot follow a straight line, the distribution has the same shape as the standard normal (Gaussian) distribution. The plot_normality() from dlookr is certainly one of the simplest ways to obtain beautiful histograms:

# Calling library
library(dlookr)

# Checking the documentation:
?plot_normality()

# Plotting histograms:
data %>%
  select(`Label`,
         `SM 41:1;O2`) %>%
  group_by(Label) %>%
  plot_normality(left = 'log',
                 right = 'sqrt',
                 col = 'steelblue')

We obtain these three beautiful histograms arranged by labels (Label column):

Preparing histograms via tidyplots (level: basic)

The tidyplots R library has a simple solution for producing beautiful publication-ready histograms. The code is straightforward: one needs to create a new tidyplot object, and add a histogram layer. The number of bins can be optimized through the binsargument (default is 30), as well as the width of bins through the binwidth - both arguments are present in the add_histogram(). Take a look at the code block below:

# Activate the tidyplots library:
library(tidyplots)

# Build a histogram 
# Create a tidyplots object + add a histogram layer - 3 lines of code only!
data %>%
  tidyplot(x= `SM 41:1;O2`, color = Label) %>%
  add_histogram() %>%
  adjust_colors(new_colors = c("royalblue", "orange", "red2")) # ...4 if you want to optimize the fillings.

The publication-ready plot:

Preparing histograms via ggpubr (level: intermediate)

The ggpubr library contains the gghistogram() function. The application of the function is straightforward - as x we indicate the column of interest, i.e. containing concentrations of a lipid/metabolite. We can group and fill bars with different colors according to biological groups (fill, palette arguments), add marginal rug (rug = T), and further customize the plot (add title of plot, axes' titles, modify their look, add additional statistics through the 'add' argument, etc.). Using facet_grid(), more histograms can be plotted at once. Take a look at the code block below and the outputs:

# Calling ggpubr library:
library(ggpubr)

# Reading the documentation about the function of interest:
?gghistogram()

# Plotting histograms for a single lipid:
gghistogram(data,
              x = "SM 41:1;O2", 
              fill = "Label", 
              palette = c('royalblue','orange', 'red2'),
              rug = T,
              xlab = "Concentration of SM 41:1;O2, [nmol/mL]",
              ylab = "Count",
              title = "Histogram - SM 41:1;O2")

Using a long tibble and facet_grid(), more histograms can be plotted at once:

# Plotting histograms for multiple lipid species:
# 1. Selecting the data of interest and creating long tibble:
data.long <- 
  data %>%
  filter(`Label` != "PAN") %>%
  droplevels() %>%
  select(`Label`,
         `SM 39:1;O2`,
         `SM 40:1;O2`,
         `SM 41:1;O2`,
         `SM 42:1;O2`) %>%
  pivot_longer(cols = `SM 39:1;O2`:`SM 42:1;O2`,
               names_to = "Lipids",
               values_to = "Concentrations")
               
# 2. Creating the plot:
gghistogram(data.long,
            x = "Concentrations",
            fill = "Label",
            palette = c('royalblue', 'red2'),
            rug = T) +
  facet_grid(. ~ Lipids, scales = 'free_x')

Important: It is necessary to add in the facet_grid() scales = 'free_x' so that every plot has its x scale.

Preparing histograms via ggplot2 (level: advanced)

Preparing a histogram in ggplot2 is achieved through geom_histogram(). For these examples, we remove the 'PAN' group from the data set (filter(Label != 'PAN')):

# Plotting a histogram for a single lipid:
data %>%
  filter(`Label` != 'PAN') %>%
  ggplot(aes(x = `SM 41:1;O2`, 
             color = `Label`, 
             fill = `Label`)) +
  geom_histogram(alpha = 0.4) +
  scale_color_manual(values = c('royalblue', 'red2')) +
  scale_fill_manual(values = c('royalblue', 'red2')) +
  theme_bw()

The plot was gently customized - plot outline and filling colors were set to 'royalblue' for 'N' and 'red2' for 'T', the opacity of histograms' filling was set to 0.4, and the classic ggplot2 gray theme was changed to theme_bw(). We obtain the following plot:

We can add mean values to this plot using geom_vline():

# Computing mean values:
means <- 
  data %>%
  filter(`Label` != 'PAN') %>%
  select(`Label`, `SM 41:1;O2`) %>%
  group_by(Label) %>%
  summarise(mean = mean(`SM 41:1;O2`))
  
# Adding mean values to the plot:
data %>%
  filter(`Label` != 'PAN') %>%
  ggplot(aes(x = `SM 41:1;O2`, 
             color = `Label`, 
             fill = `Label`)) +
  geom_histogram(alpha = 0.4) +
  geom_vline(xintercept = means$mean, 
             colour = c('darkblue', 'red2'), 
             linetype = 'dashed',
             linewidth = 1) +
  scale_color_manual(values = c('royalblue', 'red2')) +
  scale_fill_manual(values = c('royalblue', 'red2')) +
  theme_bw()

Moreover, we can also add a density plot to our histogram. In the geom_histogram(), we need to indicate in aesthetics a shared y-axis with a density plot:

... +
geom_histogram(aes(y =..density..), ...) +
... +

# Adding density plot to our histogram:
data %>%
  filter(`Label` != 'PAN') %>%
  ggplot(aes(x = `SM 41:1;O2`, 
             color = `Label`, 
             fill = `Label`)) +
  geom_histogram(aes(y =..density..),
                 alpha = 0.4) +
  geom_density(alpha = 0,
               linewidth = 1) +
  geom_vline(xintercept = means$mean, 
             colour = c('darkblue', 'red2'), 
             linetype = 'dashed',
             linewidth = 1) +
  scale_color_manual(values = c('royalblue', 'red2')) +
  scale_fill_manual(values = c('royalblue', 'red2')) +
  theme_bw()

For multiple lipids, we need to create a long tibble and use, e.g., facet_grid():

# Preparing a long tibble for plotting:
data.long <- 
  data %>%
  filter(`Label` != "PAN") %>%
  droplevels() %>%
  select(`Label`,
         `SM 39:1;O2`,
         `SM 40:1;O2`,
         `SM 41:1;O2`,
         `SM 42:1;O2`) %>%
  pivot_longer(cols = `SM 39:1;O2`:`SM 42:1;O2`,
               names_to = "Lipids",
               values_to = "Concentrations") 

# Computing mean values for vlines:
means <- 
  data %>%
  filter(`Label` != "PAN") %>%
  droplevels() %>%
  select(`Label`,
         `SM 39:1;O2`,
         `SM 40:1;O2`,
         `SM 41:1;O2`,
         `SM 42:1;O2`) %>%
  pivot_longer(cols = `SM 39:1;O2`:`SM 42:1;O2`,
               names_to = "Lipids",
               values_to = "Concentrations") %>%
  group_by(Label, Lipids) %>%
  summarise(mean = mean(Concentrations))

# Multipanel plot:
data.long %>%
  ggplot(aes(x = `Concentrations`, 
             color = `Label`, 
             fill = `Label`)) +
  geom_histogram(aes(y =..density..),
                 alpha = 0.4) +
  geom_density(alpha = 0,
               linewidth = 1) +
  geom_vline(data = means,
             aes(xintercept = mean),
             linetype = 'dashed',
             linewidth = 1,
             colour = rep(c('darkblue', 'red2'),
                          each = 4)) +
  scale_color_manual(values = c('royalblue', 'red2')) +
  scale_fill_manual(values = c('royalblue', 'red2')) +
  theme_bw() +
  facet_grid(. ~ Lipids, scales = "free_x")

We reuse most of the code from above. However, we need to modify geom_vline() appropriately. We introduce means in the geom_vline(data = means, ...) and adjust aesthetics (aes(xintercept = mean). We select linetype and linewidth, and we need to add colours of our vlines (a vector with 8 colours - as finally 8 lines will be plotted). To avoid creating a vector with 8 similar entries manually, we have the rep() function preparing it for us: take 'dark blue' and 'red2' and repeat each 4x to create a vector with 8 entries. Finally, we use face_grid() to spread lipid names across columns (create four panels), and we indicate that each plot should have an individual x-axis (scales = "free_x").

We obtain this nice chart:

The histograms can be further customized to improve the charts' appearance.

IMPORTANT: You probably found differences between the ggpubr and ggplot2 histograms. In the ggplot2 histogram, the bars' position in the geom_histogram() is set to 'stack' by default. If you want the bars of N and T observations to be overlaid like in the ggpubr, set the position to 'identity':

# Changing arrangement of bars in ggplot2 histogram:
... +
geom_histogram(position = 'identity') +
...

PreviousBox plots NextDensity plots

Last updated 4 months ago