Scatter plots

Metabolites and lipids descriptive statistical analysis in R

Scatter plots are built of dots corresponding to observations plotted in the x-y axes plane. The positions on the vertical (y position) and horizontal (x position) axes are related to concentrations of two selected metabolites or lipids that are being compared for each observation. You can also use the same metabolite/lipid but on the x-axis plot outcomes from one group, while on the y-axis from the other. Using scatter plots, we can investigate basic relationships between two selected variables, or between-group differences in variable levels. In other words, a classic scatter plot is the easiest way of presenting a statistical relationship between two variables (correlation).

Further applications of scatter plots include calibration curves or PCA, t-SNE, and UMAP score plots, etc.

Below, please find real-life examples of scatter plots used in lipidomics and metabolomics manuscripts:

D. Wolrab et al. Validation of lipidomic analysis of human plasma and serum by supercritical fluid chromatography–mass spectrometry and hydrophilic interaction liquid chromatography–mass spectrometry. DOI: https://doi.org/10.1007/s00216-020-02473-3 - Fig. 5 (authors present correlations of lipid concentrations obtained using the IS Mix or the SPLASH Lipidomix for the quantitation of serum or plasma; lipid concentrations were expressed as decimal logarithm).
D. Wolrab et al. Validation of lipidomic analysis of human plasma and serum by supercritical fluid chromatography–mass spectrometry and hydrophilic interaction liquid chromatography–mass spectrometry. DOI: https://doi.org/10.1007/s00216-020-02473-3 - Fig. 4 (the same manuscript, authors present scatter plots - application as calibration curves).
D. Wolrab et al. Validation of lipidomic analysis of human plasma and serum by supercritical fluid chromatography–mass spectrometry and hydrophilic interaction liquid chromatography–mass spectrometry. DOI: https://doi.org/10.1007/s00216-020-02473-3 - Fig. 2 (the same manuscript, authors present correlation of retention factors (decimal logarithm) obtained by SFC/MS and HILIC-UHPLC/MS for lipid classes).
S. G. Snowden et al. Combining lipidomics and machine learning to measure clinical lipids in dried blood spots. DOI: https://doi.org/10.1007/s11306-020-01703-0 - Fig. 1, Fig. 2, Fig. 3, Fig. 4 (the authors of the manuscript published in Metabolomics (Springer-Nature) extensively utilized scatter plots in their classic form to compare predicted vs. measured concentrations of triglycerides, HDL, LDL, and total cholesterol in both discovery and validation analyses).
M. Höring et al. Sex-specific response of the human plasma lipidome to short-term cold exposure. DOI: https://doi.org/10.1016/j.bbalip.2024.159567 - Fig. 4 (the authors used multiple scatter plots for presenting the correlation between PC, TG, and CE log2(fold change) and BMI).
S. Matysik et al. Unique sterol metabolite shifts in inflammatory bowel disease and primary sclerosing cholangitis. DOI: https://doi.org/10.1016/j.jsbmb.2024.106621 - Fig. 2A & B, Fig. 4A - D, Fig. 5A - B (presenting a variety of correlations, e.g., correlation of serum cholesterol with C-reactive protein; Cholesterol, C 27-hydroxycholesterol, and C 7-dehydrocholesterol in relation to fecal calprotectin, etc.).
F. Torta et al. Concordant inter-laboratory derived concentrations of ceramides in human plasma reference materials via authentic standards. DOI: https://doi.org/10.1038/s41467-024-52087-x - Fig. 2 (the authors utilized scatter plots to compare single-point and multi-point calibration outcomes in the quantitative analysis of plasma ceramides (based on correlation)).

Preparing scatter plots via ggstatsplot (level: basic)

The ggstatsplot library contains a function called ggscatterstats(), which produces a scatter plot. Above and on the right side of the scatter plot, histograms are additionally added. The function can also perform hypothesis testing, if necessary (we will skip this option: results.subtitle = FALSE) The function's application is straightforward - it expects a data frame and two variables to be plotted on x and y:

# Calling libraries:
library(ggstatsplot)
library(tidyverse)

# Reading the function's documentation:
?ggscatterstats()?

# Basic data filtering:
data.no.PAN <-
  data %>%
  filter(Label != "PAN")

# Plotting a basic ggscatter plot:
ggscatterstats(data.no.PAN,
               x = "Cer 41:1;O2", 
               y = "SM 41:1;O2",
               title = 'Scatter plot Cer 41:1;O2 to SM 41:1;O2',
               results.subtitle = FALSE,
               label.var = "Label",
               label.expression = `SM 41:1;O2` > 12.5 | `SM 41:1;O2` < 5 & `Cer 41:1;O2` < 0.45 | `Cer 41:1;O2` > 0.6)

Additionally, we want to see if differences between plasma levels of SM 41:1;O2 and Cer 41:1;O2 are strong enough to separate N controls from T patients, we will add labels to the upper and lower ranges of SM 41:1;O2 and Cer 41:1;O2 concentrations:

ggscatterstats(...,
               label.var = "Label",
               label.expression = `SM 41:1;O2` > 12.5 | `SM 41:1;O2` < 5 & `Cer 41:1;O2` < 0.45 | `Cer 41:1;O2` > 0.6)

The output:

As you see, in the left lower corner, we mostly collected patients with pancreatic cancer (T), while in the upper right corner - healthy volunteers (N). This way, we used our scatter plot to investigate if there are differences between N and T groups in plasma levels of two lipids - Cer 41:1;O2 and SM 41:1;O2. Except for it, we also see that Cer 41:1;O2 and SM 41:1;O2 levels are positively correlated, as based on high levels of one lipid, one could expect high levels of the other lipid (analysis of relationships between variables). We could also formulate it in another way - based on the low levels of Cer 41:1;O2 - we can also assume low levels of SM 41:1;O2 in a plasma sample.

Similarly to the ggbetweenstats() function, the ggscatterstats() enables further customization through its arguments (changing colors, fonts, theme, etc).

Preparing scatter plots via ggpubr (level: intermediate)

The ggpubr offers the ggscatter() function that is, in fact, as simple in application as ggscatterstats. See the examples below:

# Calling libraries:
library(ggpubr)
library(tidyverse)

# Reading function's documentation:
?ggscatter()

# Plotting a simple scatter plot:
ggscatter(data, 
          x = "Cer 41:1;O2", 
          y = "SM 41:1;O2")

We obtain:

We can filter our data (remove 'PAN' group), change the points' shape to 21 (filled circle), increase size to 4, fill the points with colors according to the biological group, add a linear function, compute the R2 goodness of fit, and correlation coefficients (let's select Spearman correlation), and p-values, and set:

# Customizing ggscatter and adding additional statistics:
ggscatter(data.no.PAN, 
          x = "Cer 41:1;O2", 
          y = "SM 41:1;O2",
          fill = "Label",
          shape = 21,
          size = 4,
          palette = c("royalblue", "red2"),
          add = 'reg.line',
          add.params = list(color = "darkblue", fill = "lightgray"),
          conf.int = TRUE,
          cor.coef = TRUE, 
          cor.coeff.args = 
            list(method = "spearman", 
                 label.x = 0.8,
                 label.y = 25,
                 label.sep = "\n"),
          xlab = "Concentration of Cer 41:1;O2 [nmol/ml]",
          ylab = "Concentration of SM 41:1;O2 [nmol/ml]",
          title = "Scatter plot of Cer 41:1;O2 and SM 41:1;O2") +
  stat_regline_equation(label.y = 22, label.x = 0.8, aes(label = ..eq.label..)) +
  stat_regline_equation(label.y = 21, label.x = 0.8, aes(label = ..rr.label..))

The output:

Changing the color of the filling, points' shape, and size is the same as in previous examples of the ggpubr plots, i.e. select the appropriate color, fill, palette, shape, size, etc. To add linear regression, we can use the add argument. In add.params, we can specify a list with parameters for regression line appearance (color, fill, etc.). If we want to plot confidence intervals around the regression line, we need to change conf.int to TRUE. The linear regression equation and R2 are obtained through stat_regline_equation (see last two lines). The first produces the equation, the second computes R2. The label.x and label.y are used to specify the position of output in the plot. To obtain the correlation coefficient, first, we need to set cor.coef to TRUE. Using cor.coeff.args, we deliver a list with parameters specifying the type of correlation ('pearson', 'kendall', or 'spearman'), and the location of the correlation coefficient and p-value on the plot (label.x, label.y). Titles of x-y axes are set through xlab and ylab, and the plot title is set via the title argument. The ggpubr scatter plot allows for many other customization options, for more please check the function documentation:

Now, let's prepare one more plot. First, we assume we want to compare correlations between SM 41:1;O2 and Cer 41:1;O2 for all N (controls) and T (patients with PDAC) separately. We also want the size of points to change according to the SM 41:1;O2 concentration. We obtain two plots (Plot.1 and Plot.2):

# Filtering all N and T:
data.N <-
  data %>%
  filter(Label == "N")

data.T <-
  data %>%
  filter(Label == "T")
  
# Preparing Plot.1:
Plot.1 <- 
  ggscatter(data.N, 
          x = "Cer 41:1;O2", 
          y = "SM 41:1;O2",
          fill = 'royalblue',
          shape = 21,
          size = "SM 41:1;O2",
          add = 'reg.line',
          add.params = list(color = "darkblue", fill = "lightgray"),
          conf.int = TRUE,
          cor.coef = TRUE, 
          cor.coeff.args = 
            list(method = "pearson", 
                 label.x = 0.8,
                 label.y = 25,
                 label.sep = "\n"),
          xlab = "Concentration of Cer 41:1;O2 [nmol/ml]",
          ylab = "Concentration of SM 41:1;O2 [nmol/ml]",
          title = "Scatter plot of Cer 41:1;O2 and SM 41:1;O2") +
  stat_regline_equation(label.y = 22, label.x = 0.8, aes(label = ..eq.label..)) +
  stat_regline_equation(label.y = 21, label.x = 0.8, aes(label = ..rr.label..))

# Plot.2:
Plot.2 <- 
  ggscatter(data.T, 
            x = "Cer 41:1;O2", 
            y = "SM 41:1;O2",
            fill = 'red2',
            shape = 21,
            size = "SM 41:1;O2",
            add = 'reg.line',
            add.params = list(color = "darkblue", fill = "lightgray"),
            conf.int = TRUE,
            cor.coef = TRUE, 
            cor.coeff.args = 
              list(method = "pearson", 
                   label.x = 0.6,
                   label.y = 25,
                   label.sep = "\n"),
            xlab = "Concentration of Cer 41:1;O2 [nmol/ml]",
            ylab = "Concentration of SM 41:1;O2 [nmol/ml]",
            title = "Scatter plot of Cer 41:1;O2 and SM 41:1;O2") +
  stat_regline_equation(label.y = 22, label.x = 0.6, aes(label = ..eq.label..)) +
  stat_regline_equation(label.y = 21, label.x = 0.6, aes(label = ..rr.label..))

We obtained two plots. It would be good to put them side-by-side to compare the Pearson correlations. We can achieve it using the ggarrange() function from the ggpubr package. The ggarrange() function is a very useful tool, that can be applied to ggstatsplot, ggpubr, and ggplot2 plots to merge them into one image. If more plots are merged, their position can be aligned (or adjusted), and a common legend can be plotted. For now, we want you to know about such a tool. We will show you within the customization how to use it in detail:

# Merging Plot.1 and Plot.2 into one image:
ggarrange(Plot.1, 
          Plot.2, 
          common.legend = T, 
          legend = 'right')

We obtain:

As you see, the Pearson correlation between SM 41:1;O2 and Cer 41:1;O2 was slightly higher for patients with pancreatic cancer (T). Also, the linear regression model has a better fit in the case of the T data. The linear regression could be of course used to predict the concentration of SM 41:1;O2 based on the concentration of Cer 41:1;O2 and vice versa.

Preparing scatter plots via ggplot2 (level: advanced)

The scatter plot can be prepared by adding to the x-y aesthetics a layer with points through the geom_point() (here we immediately customized the filling color):

# Calling library
library(tidyverse)

# Filtering the data set:
data.no.PAN <-
  data %>%
  filter(Label != "PAN")

# Simple scatter plot via ggplot2:
ggplot(data.no.PAN, aes(x = `SM 41:1;O2`, y = `Cer 41:1;O2`, fill = `Label`))+
  geom_point(shape = 21, size = 4) +
  scale_fill_manual(values = c('royalblue', 'red')) + 
  theme_bw()

The output:

To simplify adding additional elements to the scatter plot and computing statistics, we can use an excellent R package ggpmisc. You can find more about this package here:

Using stat_poly_line(), we can add predicted lines from a linear model fit (with confidence bands) and via stat_correlation() and stat_poly_eq() - complete statistics, e.g., Pearson correlations, equations, R2, results of the F-value and p-value (from the test comparing the created model with zero predictor variables - whether the coefficients we computed improved the fit). Here is the modified code:

# Install package ggpmisc (method 1):
install.packages("ggpmisc")

# Or alternatively:
# install.packages("devtools") # Use if you did not install the devtools earlier.
# Then run:
devtools::install_github("aphalo/ggpmisc")

# Call ggpmisc:
library(ggpmisc)

# Adding regression lines and statistics to our scatter plot with ggpmisc:
ggplot(data.no.PAN, 
       aes(x = `SM 41:1;O2`, 
           y = `Cer 41:1;O2`, 
           fill = `Label`,
           color = `Label`))+
  geom_point(shape = 21, size = 4, color = 'black') +
  scale_fill_manual(values = c('royalblue', 'red')) + 
  stat_poly_line() +
  scale_color_manual(values = c('royalblue', 'red')) + 
  stat_poly_eq(label.y = c(0.05, 0.0),
               label.x = c(9,10),
               use_label(c("eq", "adj.R2", "f", "p", "n"))) +
  stat_correlation(method = 'pearson') +
  theme_bw()

In the stat_poly_eq(), we specified the label.x and label.y - positions of the new labels concerning the predicted lines from a linear model fit. The use_label argument accepts a vector with all additional statistics to be computed. In our case, it is:

"eq" - equation,
"adj.R2" - adjusted R2,
"f" - F-value,
"p" - p-value,
"n" - number of observations.

The correlation coefficients are added through the stat_correlation() function (as a method we again select 'pearson'). The final output:

PreviousDensity plots NextDot plots with ggplot2 and tidyplots

Last updated 4 months ago