💪
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  • Preparing histograms via dlookr (level: basic)
  • Preparing histograms via tidyplots (level: basic)
  • Preparing histograms via ggpubr (level: intermediate)
  • Preparing histograms via ggplot2 (level: advanced)
  1. Metabolites and lipids descriptive statistical analysis in R
  2. Basic plotting in R

Histograms

Metabolites and lipids descriptive statistical analysis in R

PreviousBox plotsNextDensity plots

Last updated 3 months ago

Histograms are used to represent the frequency distributions. Values in the histogram are usually presented in bins. Every bar (rectangle) of the histogram could be referred to as 'how often these values occur in a data set'. Histograms are less frequently used to present data in manuscripts, but they are useful to assess visually, e.g., the skewness or symmetry of the distribution. Thus, histograms are used in the first steps of data preparation for statistical analysis (or data inspection), but they are less frequently presented in manuscripts in such a form.

Histograms can be used to present data, such as the distribution of particle sizes, e.g., drug-containing nanorods or exosomes released by cells:

  • M. R. Abedin et al. Antibody–drug nanoparticle induces synergistic treatment efficacies in HER2 positive breast cancer cells. DOI: - Fig. 1 (c) & (d) (presenting drug-containing nanorods length and diameter as histograms, i.e., particle count is plotted vs. length or diameter bins).

  • G. K. Patel et al. Comparative analysis of exosome isolation methods using culture supernatant for optimum yield, purity and downstream applications. DOI: - Fig. 2 (histograms were used to display the percentage of intensity (frequency) associated with exosome groups of specific radius value ranges).

Other examples are, for instance, the relative abundance of different C=C location isomers (similarly, the histogram could be used for presenting lipid concentration distributions between two or more experimental groups) :

  • Z. Li et al. Single-cell lipidomics with high structural specificity by mass spectrometry. DOI: - Fig. 4b, c, d (study in Nature Communications showing histograms of relative abundance of C=C location isomers of PC 34:2 in four types of single cells).

Or elements of technical validations of analytical methods related to frequency aspects, e.g.:

  • D. K. Barupal et al. Generation and quality control of lipidomics data for the alzheimer’s disease neuroimaging initiative cohort. DOI: - Fig. 4 (the authors prepared a histogram that shows the distribution of RSD (%) for compounds in the QC samples (x-axis) vs. frequency of occurrence (y-axis), presenting the reproducibility of peak heights for the detected compounds in QC samples).

Or for identifying discordant lipids/metabolite features in statistical analyses performed on two similar data sets:

  • A. Dakic et al. Imputation of plasma lipid species to facilitate integration of lipidomic datasets. DOI: (Fig. 1a).

The histogram and kernel density curve were used to present the distribution of heritability estimates across all the lipid species in another study:

  • R. Tabassum et al. Genetic architecture of human plasma lipidome and its link to cardiovascular disease. DOI: - Fig. 2a (upper part of panel).

Authors of the following manuscript published in Clinics and Research in Hepatology and Gastroenterology (Elsevier) used histograms for the comparison of probability distributions of circulating TG and glucose levels between healthy individuals and hepatic steatosis patients:

  • J. Zhou et al. Probabilistic Scatter Plots for visualizing carbohydrate and lipid metabolism states in Non-Alcoholic Fatty Liver Disease. DOI: - Fig. 1 & 2.

Preparing histograms via dlookr (level: basic)

The dlookr library contains the function plot_normality(), which can be connected with tidyverse tools. This function immediately enables us to assess how the distribution changes if transformation methods are applied as additional histograms are plotted for the transformed data on the left and right side on the bottom. A user can select between multiple transformation methods through arguments left and right. The color of the histogram can be changed through the col argument. Together with the histogram, a Q-Q plot is produced. The Q-Q plot also enables assessing the data distribution. If points (dots) in the Q-Q plot follow a straight line, the distribution has the same shape as the standard normal (Gaussian) distribution. The plot_normality() from dlookr is certainly one of the simplest ways to obtain beautiful histograms:

# Calling library
library(dlookr)

# Checking the documentation:
?plot_normality()

# Plotting histograms:
data %>%
  select(`Label`,
         `SM 41:1;O2`) %>%
  group_by(Label) %>%
  plot_normality(left = 'log',
                 right = 'sqrt',
                 col = 'steelblue')

We obtain these three beautiful histograms arranged by labels (Label column):

Preparing histograms via tidyplots (level: basic)

The tidyplots R library has a simple solution for producing beautiful publication-ready histograms. The code is straightforward: one needs to create a new tidyplot object, and add a histogram layer. The number of bins can be optimized through the binsargument (default is 30), as well as the width of bins through the binwidth - both arguments are present in the add_histogram(). Take a look at the code block below:

# Activate the tidyplots library:
library(tidyplots)

# Build a histogram 
# Create a tidyplots object + add a histogram layer - 3 lines of code only!
data %>%
  tidyplot(x= `SM 41:1;O2`, color = Label) %>%
  add_histogram() %>%
  adjust_colors(new_colors = c("royalblue", "orange", "red2")) # ...4 if you want to optimize the fillings.

The publication-ready plot:

Preparing histograms via ggpubr (level: intermediate)

The ggpubr library contains the gghistogram() function. The application of the function is straightforward - as x we indicate the column of interest, i.e. containing concentrations of a lipid/metabolite. We can group and fill bars with different colors according to biological groups (fill, palette arguments), add marginal rug (rug = T), and further customize the plot (add title of plot, axes' titles, modify their look, add additional statistics through the 'add' argument, etc.). Using facet_grid(), more histograms can be plotted at once. Take a look at the code block below and the outputs:

# Calling ggpubr library:
library(ggpubr)

# Reading the documentation about the function of interest:
?gghistogram()

# Plotting histograms for a single lipid:
gghistogram(data,
              x = "SM 41:1;O2", 
              fill = "Label", 
              palette = c('royalblue','orange', 'red2'),
              rug = T,
              xlab = "Concentration of SM 41:1;O2, [nmol/mL]",
              ylab = "Count",
              title = "Histogram - SM 41:1;O2") 

Using a long tibble and facet_grid(), more histograms can be plotted at once:

# Plotting histograms for multiple lipid species:
# 1. Selecting the data of interest and creating long tibble:
data.long <- 
  data %>%
  filter(`Label` != "PAN") %>%
  droplevels() %>%
  select(`Label`,
         `SM 39:1;O2`,
         `SM 40:1;O2`,
         `SM 41:1;O2`,
         `SM 42:1;O2`) %>%
  pivot_longer(cols = `SM 39:1;O2`:`SM 42:1;O2`,
               names_to = "Lipids",
               values_to = "Concentrations")
               
# 2. Creating the plot:
gghistogram(data.long,
            x = "Concentrations",
            fill = "Label",
            palette = c('royalblue', 'red2'),
            rug = T) +
  facet_grid(. ~ Lipids, scales = 'free_x')

Important: It is necessary to add in the facet_grid() scales = 'free_x' so that every plot has its x scale.

Preparing histograms via ggplot2 (level: advanced)

Preparing a histogram in ggplot2 is achieved through geom_histogram(). For these examples, we remove the 'PAN' group from the data set (filter(Label != 'PAN')):

# Plotting a histogram for a single lipid:
data %>%
  filter(`Label` != 'PAN') %>%
  ggplot(aes(x = `SM 41:1;O2`, 
             color = `Label`, 
             fill = `Label`)) +
  geom_histogram(alpha = 0.4) +
  scale_color_manual(values = c('royalblue', 'red2')) +
  scale_fill_manual(values = c('royalblue', 'red2')) +
  theme_bw()

The plot was gently customized - plot outline and filling colors were set to 'royalblue' for 'N' and 'red2' for 'T', the opacity of histograms' filling was set to 0.4, and the classic ggplot2 gray theme was changed to theme_bw(). We obtain the following plot:

We can add mean values to this plot using geom_vline():

# Computing mean values:
means <- 
  data %>%
  filter(`Label` != 'PAN') %>%
  select(`Label`, `SM 41:1;O2`) %>%
  group_by(Label) %>%
  summarise(mean = mean(`SM 41:1;O2`))
  
# Adding mean values to the plot:
data %>%
  filter(`Label` != 'PAN') %>%
  ggplot(aes(x = `SM 41:1;O2`, 
             color = `Label`, 
             fill = `Label`)) +
  geom_histogram(alpha = 0.4) +
  geom_vline(xintercept = means$mean, 
             colour = c('darkblue', 'red2'), 
             linetype = 'dashed',
             linewidth = 1) +
  scale_color_manual(values = c('royalblue', 'red2')) +
  scale_fill_manual(values = c('royalblue', 'red2')) +
  theme_bw()

Moreover, we can also add a density plot to our histogram. In the geom_histogram(), we need to indicate in aesthetics a shared y-axis with a density plot:

... +
geom_histogram(aes(y =..density..), ...) +
... +
# Adding density plot to our histogram:
data %>%
  filter(`Label` != 'PAN') %>%
  ggplot(aes(x = `SM 41:1;O2`, 
             color = `Label`, 
             fill = `Label`)) +
  geom_histogram(aes(y =..density..),
                 alpha = 0.4) +
  geom_density(alpha = 0,
               linewidth = 1) +
  geom_vline(xintercept = means$mean, 
             colour = c('darkblue', 'red2'), 
             linetype = 'dashed',
             linewidth = 1) +
  scale_color_manual(values = c('royalblue', 'red2')) +
  scale_fill_manual(values = c('royalblue', 'red2')) +
  theme_bw()

For multiple lipids, we need to create a long tibble and use, e.g., facet_grid():

# Preparing a long tibble for plotting:
data.long <- 
  data %>%
  filter(`Label` != "PAN") %>%
  droplevels() %>%
  select(`Label`,
         `SM 39:1;O2`,
         `SM 40:1;O2`,
         `SM 41:1;O2`,
         `SM 42:1;O2`) %>%
  pivot_longer(cols = `SM 39:1;O2`:`SM 42:1;O2`,
               names_to = "Lipids",
               values_to = "Concentrations") 

# Computing mean values for vlines:
means <- 
  data %>%
  filter(`Label` != "PAN") %>%
  droplevels() %>%
  select(`Label`,
         `SM 39:1;O2`,
         `SM 40:1;O2`,
         `SM 41:1;O2`,
         `SM 42:1;O2`) %>%
  pivot_longer(cols = `SM 39:1;O2`:`SM 42:1;O2`,
               names_to = "Lipids",
               values_to = "Concentrations") %>%
  group_by(Label, Lipids) %>%
  summarise(mean = mean(Concentrations))

# Multipanel plot:
data.long %>%
  ggplot(aes(x = `Concentrations`, 
             color = `Label`, 
             fill = `Label`)) +
  geom_histogram(aes(y =..density..),
                 alpha = 0.4) +
  geom_density(alpha = 0,
               linewidth = 1) +
  geom_vline(data = means,
             aes(xintercept = mean),
             linetype = 'dashed',
             linewidth = 1,
             colour = rep(c('darkblue', 'red2'),
                          each = 4)) +
  scale_color_manual(values = c('royalblue', 'red2')) +
  scale_fill_manual(values = c('royalblue', 'red2')) +
  theme_bw() +
  facet_grid(. ~ Lipids, scales = "free_x")

We reuse most of the code from above. However, we need to modify geom_vline() appropriately. We introduce means in the geom_vline(data = means, ...) and adjust aesthetics (aes(xintercept = mean). We select linetype and linewidth, and we need to add colours of our vlines (a vector with 8 colours - as finally 8 lines will be plotted). To avoid creating a vector with 8 similar entries manually, we have the rep() function preparing it for us: take 'dark blue' and 'red2' and repeat each 4x to create a vector with 8 entries. Finally, we use face_grid() to spread lipid names across columns (create four panels), and we indicate that each plot should have an individual x-axis (scales = "free_x").

We obtain this nice chart:

The histograms can be further customized to improve the charts' appearance.

IMPORTANT: You probably found differences between the ggpubr and ggplot2 histograms. In the ggplot2 histogram, the bars' position in the geom_histogram() is set to 'stack' by default. If you want the bars of N and T observations to be overlaid like in the ggpubr, set the position to 'identity':

# Changing arrangement of bars in ggplot2 histogram:
... +
geom_histogram(position = 'identity') +
...
https://doi.org/10.1038/s41598-021-86762-6
https://doi.org/10.1038/s41598-019-41800-2
https://doi.org/10.1038/s41467-021-23161-5
https://doi.org/10.1038/sdata.2018.263
https://doi.org/10.1038/s41467-024-45838-3
https://doi.org/10.1038/s41467-019-11954-8
https://doi.org/10.1016/j.clinre.2024.102365
Histograms obtained through plot_normality() from the dlookr package.
The histogram obtained using tidyplots R library. Four lines of code and a publication-ready plot is done!
A histogram obtained through gghistogram (ggpubr library).
Histograms of multiple lipids plotted through gghistogram() and facet_grid().
A ggplot2 histogram. The layer was added through geom_histogram().
Histograms with mean values (geom_histogram() with geom_vline()).
Histograms with mean values and density plots (geom_histogram(), geom_vline(), geom_density()).
Histograms prepared using ggplot2.