Box plots
Metabolites and lipids descriptive statistical analysis in R
Last updated
Metabolites and lipids descriptive statistical analysis in R
Last updated
Box plots provide deep insight into descriptive statistics, considering that one can visualize the median, first quartile, third quartile, outliers, and minimum and maximum values using box plots. Box plots are frequently co-plotted with dot plots or density plots - to represent the data set unambiguously. For this reason, box plots frequently appear in lipidomics and metabolomics manuscripts for comparing outcomes among experimental groups.
Check out the real-life examples provided below:
O. Vvedenskaya et al. Nonalcoholic fatty liver disease stratification by liver lipidomics. DOI: - Fig. 1 (authors used box plots co-plotted with dot plots for comparing sums of lipid species concentrations within lipid classes among four different experimental groups).
J. Idkowiak et al. Robust and high-throughput lipidomic quantitation of human blood samples using flow injection analysis with tandem mass spectrometry for clinical use. DOI: - Fig. 6 (authors used box plots with dot plots to present downregulation of SM 41:1;O2, LPC 18:2, and Cer 42:1;O2 in plasma of kidney cancer patients (T—red) compared to healthy volunteers (N— blue) as measured by three different analytical methods, including two types of FIA-MS/MS and UHPSFC/MS).
D. Wolrab et al. Plasma lipidomic profiles of kidney, breast and prostate cancer patients differ from healthy controls. DOI: - Fig. 4C & 4E, Fig. 5M and 5N (the authors employ box plots enhanced with dot plots to illustrate lipid concentration differences for various lipids identified by OPLS-DA as distinguishing features between patients with kidney, breast, and prostate cancers and healthy volunteers).
S. Salihovic et al. Identification and validation of a blood- based diagnostic lipidomic signature of pediatric inflammatory bowel disease. DOI: - Fig. 4a (box plots enhanced with dot plots were used to present the log-transformed unit variance scaled distribution of LacCer(d18:1/16:0) and PC(18:1/22:6) in individuals with IBD compared to symptomatic controls in the validation cohort).
R. Jirásko et al. Altered Plasma, Urine, and Tissue Profiles of Sulfatides and Sphingomyelins in Patients with Renal Cell Carcinoma. DOI: - Fig. 2C - D, Fig. 3A - C, Fig 4B - D (box plots were used throughout the manuscript for presenting alterations in sphingolipid levels in plasma, urine, and tissues caused by kidney cancer).
D. Wolrab et al. Lipidomic profiling of human serum enables detection of pancreatic cancer. DOI: - Fig. 3g - j, Fig. 4 a - c, Fig. 7b - f, Fig. 8 b - e (the authors often use box plots to visualize alterations in serum levels of lipids caused by pancreatic cancer (PDAC), and pancreatitis).
O. Peterka & A. Maccelli et al. HILIC/MS quantitation of low-abundant phospholipids and sphingolipids in human plasma and serum: Dysregulation in pancreatic cancer. DOI: - Fig. 3E - (authors used box plots with dot plots to present alterations in plasma and serum levels of low-abundant glycosphingolipid caused by pancreatic cancer).
E. Ivanovová et al. Plasma Short-Chain Fatty Acids and Their Derivatives in Women with Gestational Diabetes Mellitus. DOI: - Fig. 1 & 2 (the authors use classic box plots to show distributions of clinical parameters across the studied experimental groups).
H. Zhang et al. Single-cell lipidomics enabled by dual-polarity ionization and ion mobility-mass spectrometry imaging. DOI: - Fig. 7 f-k (the application in mass spectrometry imaging; the authors employed box plots combined with dot plots to present the relative intensity ratios of selected ions to the base peak (m/z 313.1428, derived from the DAN matrix) for comparing metabolite levels between two experimental groups).
N. De Silva Mohotti et al. Lipidomic Analysis Reveals Differences in the Extent of Remyelination in the Brain and Spinal Cord. DOI: - Fig. 2, Fig. 3 & Fig. 7 (the authors used box plots for comparing total lipid class concentrations across different experimental groups in a study dedicated to remyelination in the brain and spinal cord).
D. Hornburg et al. Dynamic lipidome alterations associated with human health, disease and ageing. DOI: - Fig. 1d & 1e (the authors used box plots, e.g., for presenting a concentration range of determined lipid classes and subclasses across all participants).
The ggstatsplot is a great package dedicated to novices with R. One function can deliver beautiful box plots, box plots co-plotted with dot plots, or violing box plots with complete descriptive and univariate statistics. More information about the ggstatsplot package can be found here:
Here, we will show you how to use the ggbetweenstats() function. We keep using the lipidomics data set saved in the global environment as 'data' tibble:
This single line of code produces the eye-catching plot:
This publication-ready plot is a violin box plot. It contains maximum details - the library selects appropriate statistical tests and computes descriptive statistics. The ggbetweenstats() function enables further customization of the plot, e.g., by using functions from the ggplot2 package. For instance, we can change the color of the dots' filling to royal blue, orange, and red2 using scale_color_manual, and adjust the plot theme using theme():
The output:
However, it is not possible to fill box plots or violin box plots with different colors, e.g., according to the biological group. Using fill argument in violin.args or boxplot.args, only one color of filling can be selected for all box plots or violin plots.
Let's try to drop geom_violin() by setting the width to 0 (box plots overlaid with dot plot), or geom_point() by setting transparency to 0 (violin box plots only):
The outputs:
Statistical details around the plot can be easily dropped by setting results.subtitle to FALSE:
We obtain:
The user can easily change x and y-axis titles (xlab, ylab arguments), and add a plot title (title argument) and subtitle (subtitle argument), or even plot caption (caption argument).
The output:
The ggbetweenstats also enables visualizing centrality tendency (centrality.plotting = TRUE). A user can select between mean (parametric statistics), median (non-parametric statistics), trimmed mean (robust statistics), and MAP estimators (Bayesian statistics). The graphical representation (a symbol and label) can be further adjusted using centrality.point.args and centrality.label.args. See the example below:
The output:
The type of statistical tests used by ggbetweenstats() depends on:
the number of biological groups (determined automatically),
the selected statistical approach, which can be changed through the type argument to: "parametric", "non-parametric", "robust", and "bayes",
equality of variances, which is specified through var.equal.
For two-sample between-subjects comparisons, the Student's t-test (equal variances) or Welch's t-test (unequal variances) is selected via type = 'parametric' and the Mann-Whitney U test through type = 'nonparametric'.
If more than two groups are present in the data set, type = 'parametric' leads to Fisher's (equal variances) or Welch's one-way ANOVA (unequal variances), and Games-Howell test (unequal variances) or Student's t-test (unequal variances) for pairwise comparisons. Setting type to 'nonparametric' results in the application of the Kruskal-Wallis one-way ANOVA and the Dunn test for pairwise comparisons.
The obtained p-values are adjusted for multiple comparisons, and the correction method can be changed by setting p.adjust.method. See the examples below:
The outputs with explanations:
Important: Remember to adjust centrality.type to 'parametric' when selecting t-test or ANOVA and to 'nonparametric' if you select Mann Whitney U test or Kruskal-Wallis test.
If more box plots are created, you can combine all of them into one grid of plots using combine_plots(). Plots are delivered as a list. Using plotgrid.args, the arrangement of plots can be adjusted:
The output:
The ggstatsplot library offers the ggwithinstats() function to handle studies with paired samples. Paired samples are, e.g., biological materials collected from the same subjects - healthy and tumor tissues, collections repeated over time, or repeated measurements of the same material. Using ggwithinstats(), box plots are obtained and dots corresponding to paired samples are connected by dashed lines. Samples should be arranged in the same way in every biological group to obtain the correct plot.
We downloaded the data from the quantitative measurements of tissues to produce this plot, the article is available here:
Go to supplementary materials to the article, select Excel file: 'Supplementary_Data_S1_Cancers.xlsx', and download the data: 'Clinical data together with intensities of individual lipids normalized to internal standard in tissue samples', sheet: Data S1I. Read them into R as 'data.paired'. Adjust column types.
See the example below:
The output:
Similarly to ggbetweenstats(), the centrality and test type can be changed via type and centrality.type arguments, e.g. you can select between 'parametric' and 'nonparametric'. Plots obtained using ggwithinstats() can also be gently customized using ggplot2 functions, e.g. theme_bw() or scale_color_manual() - as in the example above.
Important: All these plots can be simply exported from RStudio using the 'Export' option (look for it right above your plot). Select size, format, directory, and save the plot.
The R tidyplots can also be used to generate publication-ready box plots. It relies on the same simple concept as in the case of bar charts, i.e., one needs to create a tidyplots object and add layers of visualization that can be adjusted (customized) in the next step. Tidyplots is also fully compatible with tidyverse solutions.
Take a look at the code blocks below to prepare box plots for a single variable and for more lipids co-plotted in one x-y plane.
Final output:
If one would use a long tibble with several selected lipid (metabolite) concentrations, tidyplots can be used to plot them all in one x-y plane:
The obtained plot:
For more information, we encourage you to visit the tidyplots-dedicated website and pre-print published by authors:
Classic and simple box plots in ggpubr are generated using ggboxplot() function. The ggboxplot() function is very similar in handling to the ggbarplot() function, as you will see in the example below:
In this way, we obtain:
We can gently customize the box plots, by changing the color of the filling to royal blue (N), orange (PAN), and red2 (T) using the palette argument:
Changing the color of the filling:
According to the documentation, except for custom palettes, palette accepts also brewer palettes ("RdBu", "Blues"), or even palettes from other packages like ggsci R (scientific journal palettes).
The ggboxplot function contains the argument 'notch', which by default is set to FALSE. If we switch it to TRUE, the boxes will be narrowed around the median, illustrating comparison intervals for medians. If notches of two boxes do not overlap, then the differences between the medians around which these notches were created are significant. See the example below:
The obtained plot:
The confidence interval illustrated by notches may be wider than the interquartile range. This happens, for example, if a group contains a small number of observations (for small numbers of observations confidence intervals are usually wide). In this case, we will observe the following effect:
Usually, it is better to switch the argument notch in such cases to FALSE. It will be also suggested to you by the library (see figure above panel B).
Box plots are frequently co-plotted with dots representing observations to depict the collected sample of a population unambiguously. The ggpubr function enables adding 'dotplot', 'point', or 'jitter':
The output:
The ggpubr library also offers a solution for plotting paired samples as box plots - the ggpaired() function. To demonstrate it, we will use the same data set from the article of Jirásko et al. (please check the data set used for ggwithinstats() function from the ggstatsplot package). First, we check the documentation of the new function:
As you see, the function expects one tibble (data frame) with data. However, concentrations corresponding to condition 1 (cond1 argument, our group 'N') and condition 2 (cond2 argument, our group 'T') must be split into two individual columns:
ggpaired(<tibble with data>,
cond1 = <column_1_with_all_conc_for_N_for_lipid_X>,
cond1 = <column_2_with_all_conc_for_T_for_lipid_X>,
...)
We will need tidyverse tools to remodel our 'data.paired' tibble first:
Before explaining the code block from above, remember that the 'data.paired' contains concentrations of lipids measured in adjacent and tumor tissues, but both N and T parts are collected from one subject. Hence, these samples are paired. Here is a glimpse at the 'data.paired' tibble:
As you see, we take the entire 'data.paired' tibble, filter out all rows containing 'N', and select columns Sample code
and SM 42:1;O2
. You know all of these operations already. However, at the end of the pipeline, we added one more new function from dplyr, i.e. rename_if(). This is because if we need to deliver concentrations of lipid, e.g. SM 42:1;O2, to ggpaired() in two separate arguments (cond1, cond2), so as two separate columns. These columns, both in the same data frame, cannot have the same name - SM 42:1;O2
. Using rename_if() - we rename all numeric columns, and to the name of each numeric column ".N'.
Next, we perform similar operations for all rows containing "T" label, until the mutate() function. If you look at the Sample code
column, you will realize that the authors added to the name of every tumor sample "T" at the beginning. We will need to remove this "T". We will explain why shortly. To remove "T" at the beginning of every entry in Sample code
, we use mutate() and str_sub() functions. The str_sub() function is from stringr package (tidyverse collection). The stringr contains functions that allow handling strings of characters. The str_sub() will extract from every entry in the Sample code
second character and all the rest of the characters (omitting the first one - "T"), and mutate() will be used to remodel the Sample code
column. Once this is performed, we pipe data to rename_if(), to add to every numeric column name ".T" at the end (the same way as we did for all "N" samples).
After these operations, we have two tibbles: data.N and data.T. Using left_join() from dplyr, we can merge them into one 'data.paired.final' tibble. We will use the Sample code
column to merge these two tibbles (this is why we needed to have identical entries in both tibbles in the Sample code
) column:
Creating tibble for plotting box plots with ggpaired() is also shown in the figure below:
Now, we can finally generate our plot, using the following code:
We obtained this gently customized plot:
The customization of ggpubr box plots will be shown in the next subchapter.
The ggplot2 library offers the highest flexibility in modifying the box plot charts, resulting in beautiful plots. However, the complete customization potentially leads to complicated code, which may be difficult to comprehend for a beginner. Hence, we decided that the ggplot2 is the most complicated solution of all three proposed here. Importantly, the ggstatsplot and ggpubr packages rely on ggplot2. Box plots in ggplot2 are obtained through geom_boxplot() added to aesthetics mappings:
The output:
Obtaining a plot via long tibble is also presented in the figure below as a reminder:
Through the notch argument, we can easily add notches to the box plots. We can also change the filling color through scale_fill_manual():
The output:
If the notches go outside hinges, the ggplot2 library will also suggest switching the argument notch back the the default FALSE.
Now, we will focus on adding observations to box plots, e.g., in the form of points. We first modify the code for box plots to depict a single lipid. A layer containing points can be added through geom_point(), as in the example below:
And we obtain:
As you see in the code above, we also set the outlier.shape in the geom_boxplot() to NA, to remove outliers from this plot. This is to avoid plotting the same points twice - first through geom_boxplot(), and second - via geom_point().
The position of points can be adjusted through jitter, so the points are not so strictly attached to the box plot's axis of symmetry (argument position = position_jitter(width = x) in the geom_point():
The output:
The way the observations are presented can be adjusted using geom_point() arguments. For instance, the argument shape allows for changing points' shape. All points' shapes are presented here:
Or here:
Important: The geom_point contains, e.g., a circle (no. 1) and a filled circle (no. 21). If you want to customize the color of the circles' filling, do not use empty shapes - it will not work! Always use the filled shapes to customize the color of the filling, e.g., shape no. 21 (for circle).
Below, you will find some examples of customizing points' shapes:
The output:
The points can be further modified in terms of size, transparency, etc. We will show the customization in the next subchapter.
Now, let's add points to box plots depicting multiple lipid species. This is a more demanding task.
The output:
So far, no complications. However, what if we would like to jittered points? Let's try what will happen:
We obtain this plot:
As you see, our points were filled according to biological groups, but unfortunately, all remained collected in one group around the central lipid label, instead of split across biological groups. We can try to fix this situation using the base R function - interaction(). The interaction() computes a factor that represents an interaction of selected factors. Using interaction(), we will create on the x-axis a label composed not only of Lipids
but albo Label
entries:
The output:
We fixed the issue of overlapping jitter points. However, we would need to reorganize the x-axis labels as they overlap, and it is not possible to read them. As the labels are long, we could present them using an angle, e.g.:
The output:
This chart certainly requires further customization, e.g. we could remove the gray background, adjust x-y axes, change x-axis labels, add a plot title, etc. We will show you how to customize this plot in the next subchapter.
Here, we will show you one more way to solve the issue of overlapping jitter points through facet_grid() and theme(). The facet_grid() creates a grid of plots for all combinations of our two categorical variables: Lipids
and Label
. Using the facet_grid() function is also explained by the ggplot2 authors here:
Through facet_grid(), we can introduce our second categorical variable into the plot code. There are a couple of ways we can use facet_grid(), e.g.:
facet_grid(. ~ Lipids) - lipid names from the Lipids
column will be spread across columns. Under every lipid name, a plot will be placed.
facet_grid(. ~ Label) - here, biological groups from the Label
column will be spread across columns. Under every biological group, a plot will be placed.
facet_grid(Lipids ~ .) - here, lipid names from the Lipids
column will be spread across rows, and a plot will be placed in every row.
etc.
We would be interested in the first option. We implement the function into the code:
The output:
The lipid names in the gray strips can be flipped, so that they are on the bottom. We use the switch argument from the facet_grid():
facet_grid(..., switch = "x")
And in the code:
We obtain:
Now, using theme(), we can change the strip position with x-axis labels and remove the gray strip background. We set strip.placement to "outside" and strip.backgroud to "element_blank()":
We obtain:
Finally, we would be interested in all box plots being presented in one x-y plane. As they are plotted along the same y-axis, we just need to remove x-axis spacing through panel.spacing.x. This argument expects an <unit> object, we can create it through unit(0, 'cm'). The final code:
And our final plot:
The box plot type used by us in all examples is a classic Tukey box plot, assuming the normal distribution of all samples of populations. However, it is common that in lipidomics and metabolomics data occur outliers. Because of their presence, skewed distributions are frequently observed for multiple features in the data set. In such a case, an adjusted box plot for skewed distributions may be a better solution than the classic Tukey box plot. We can produce the adjusted box plot by using stat_adj_boxplot() function from the litteR package:
The stat_adj_boxplot() function computes all parameters of the adjusted box plot to be used by ggplot2. The implementation of the function is very simple - we just substitute geom_boxplot() with stat_adj_boxplot():
The output:
A comparison of Tukey box plots and the adjusted box plots for skewed distributions is presented below (take a look at the width of the whiskers):
A very popular form is also co-plotting box plots with density plots, known as violin plots. Violin plots are created by geom_violin(). As we usually want our violin plot to be in the background of the box plot, not covering it, we add the layer with the violin plot first (geom_violin()), then the layer with box plots (geom_boxplot()), and finally we add the layer with points (geom_points). We can also change the width of the box plot to 0.4, so it is not significantly wider than the violin plot:
The output:
Finally, in ggplot2, you can also create box plots for comparing paired samples. It is only necessary to add another layer through geom_line() (lines connecting points) and specify grouping variable (e.g., sample names, which should be the same for all N and T samples):
.... + geom_line(aes(group = <name_of_grouping_variable>)).
Lines can connect dodged or jittered points.
To demonstrate it, we will again use the data set from the article of Jirásko et al. (please check the data set used for ggwithinstats() function from the ggstatsplot package). The data set is stored in the global environment as 'data.paired'. Here is the code:
We obtain the following plot:
If we want to plot more than one lipid in the x-y plane, we prepare a long tibble. Remember, that we will also need the Sample code
column:
We add another layer through geom_line(), and group using the same codes (Sample code
column) for corresponding N and T samples. The rest of the code is identical to the code from above (for plotting multiple lipids' box plots in one x-y plane):
We obtain this, ready for further customization, plot:
We can also keep the lipid species in separate panels:
The output:
Important: Remember that the plots can be easily exported through the "Export" option above your chart in RStudio. Select a directory, file type, and size, and save the plot.
Lipids
spread across columns - under every lipid name a plot was placed (facet_grid() function).Sample code
column - removing T from the sample code for all cancerous parts.