Adding annotations for statistical significance above bar charts, box plots, and dot plots has become standard practice in lipidomics and metabolomics articles. Check out the following manuscripts:
R. Jirásko et al. Altered Plasma, Urine, and Tissue Profiles of Sulfatides and Sphingomyelins in Patients with Renal Cell Carcinoma. DOI: - Fig. 2C & D, Fig. 3, Fig. 4B - D.
E. Rysman et al. De novo Lipogenesis Protects Cancer Cells from Free Radicals and Chemotherapeutics by Promoting Membrane Lipid Saturation. DOI: - Fig. 1A & B, Fig. 3A, Fig. 4A & B, Fig. 6A, C - F.
V. de Laat et al. Intrinsic temperature increase drives lipid metabolism towards ferroptosis evasion and chemotherapy resistance in pancreatic cancer. DOI: - Fig. 1a, b, d, e, h, i, j, and more.
F. Torta et al. Concordant inter-laboratory derived concentrations of ceramides in human plasma reference materials via authentic standards. DOI: - Fig. 3.
J. Wu et al. Lipidomic signatures align with inflammatory patterns and outcomes in critical illness. DOI: - Fig. 1e, Fig. 2f, Fig. 5b & e.
O. Peterka & A. Maccelli et al. HILIC/MS quantitation of low-abundant phospholipids and sphingolipids in human plasma and serum: Dysregulation in pancreatic cancer. DOI: - Fig. 3E.
H. Tsugawa et al. A lipidome landscape of aging in mice. DOI: - e.g., Fig. 8.
Annotations corresponding to statistical significance can be added to ggpubr and ggplot2 charts through the stat_pvalue_manual() function from the ggpubr library. The application of this function is straightforward if hypothesis testing is performed using functions from rstatix library. These functions create tibbles with defined column names, which are expected by the stat_pvalue_manual() - look at the 'data' argument in the documentation of this function. Additionally, we need to define the type of label - whether it should be a p-value, asterisks corresponding to p-value ranges, adjusted p-values, or others. Take a look at the code blocks below to understand the stat_pvalue_manual() function.
First, we will begin with t-test annotations. We create a long tibble and compute t-tests for all features.
# Adding annotations from hypothesis testing to ggpubr / ggplot2 charts.
# First, create a long tibble:
data.long <-
data %>%
select(-`Sample Name`) %>%
filter(Label != "PAN") %>%
pivot_longer(cols = `CE 16:1`:`SM 42:1;O2`,
names_to = "Lipids",
values_to = "Concentrations") %>%
droplevels() # We ensure the "PAN" label is removed from grouping variable
# Next, compute a t-test between N and T, adjust p-values, and add asterisks for p.adj:
t.test <-
data.long %>%
group_by(Lipids) %>%
t_test(Concentrations ~ Label) %>%
adjust_pvalue(method = "BH") %>%
add_significance(p.col = "p.adj") # The asterisks will correspond to the adjusted p-value ranges!
print(t.test)
# This is an example of tibble that stat_pvalue_manual() needs - take a look at it.
We obtain the following tibble:
As our tibble with results is ready, we need to perform one more operation. To add annotations, we need to define their x and y positions in the plot. The ggpubr provides a function add_xy_position() that can do this for you.
Let's begin with a simple plot for one selected lipid, e.g., SM 41:1;O2. We filter the 't.test' tibble by rows to keep t-test results only for SM 41:1;O2. Next, we add x and y positions. In the add_xy_position(), we specify scales = "free" to make sure positions are exclusive for every annotation.
# Adding x-y position in the chart for annotations. Single lipid case.
t.test <-
t.test %>%
filter(Lipids == "SM 41:1;O2") %>%
add_xy_position(scales = 'free')
print(t.test)
Now, let's create our first chart with annotations - let's begin with a raw p-value. The updated 't.test' tibble contains now only information for one lipid - SM 41:1;O2. In the stat_pvalue_manual(), we make sure the ggplot2 aesthetics are not inherited (inherit.aes set to FALSE), and that the label will be raw p-value - label is taken from the 'p' column of 't.test' tibble (label = "p").
Let's create box plots for this lipid with annotations:
Now, let's move to multi-panel plots created, e.g., via facet_grid(). Here, it is crucial that you specify scales = "free" in the add_xy_position(). Otherwise, the add_xy_position() function will add non-matching y-positions. Take a look at the example below:
# Computing t-test for all variables:
t.test <-
data.long %>%
group_by(Lipids) %>%
t_test(Concentrations ~ Label) %>%
adjust_pvalue(method = "BH") %>%
add_significance(p.col = "p.adj")
# Selecting results for lipids of interest -- for plotting (row-wise through filter()):
t.test.selected <-
t.test %>%
filter(Lipids == "SM 39:1;O2" |
Lipids == "SM 40:1;O2" |
Lipids == "SM 41:1;O2" |
Lipids == "SM 42:1;O2")
# OR check the number of rows and select them directly, by the row number:
t.test.selected <-
t.test[c(92,93,95,97),]
# Adding x-y position to the 't.test.selected' tibble:
t.test.selected <- t.test.selected %>% add_xy_position(scales = 'free')
# Creating box plots with raw p-values:
data %>%
filter(Label != "PAN") %>%
select(`Label`,
`SM 39:1;O2`,
`SM 40:1;O2`,
`SM 41:1;O2`,
`SM 42:1;O2`) %>%
pivot_longer(cols = `SM 39:1;O2`:`SM 42:1;O2`,
names_to = "Lipids",
values_to = "Concentrations") %>%
ggplot(aes(x = `Label`,
y = `Concentrations`,
fill = `Label`)) +
geom_boxplot(outlier.shape = NA) +
geom_point(position = position_jitter(width = 0.1),
shape = 23) +
theme_bw() +
scale_fill_manual(values = c('royalblue', 'red2')) +
facet_grid(~ Lipids, scales = 'free_y') +
stat_pvalue_manual(t.test.selected,
label = "p",
inherit.aes = F)
We obtain:
We would proceed similarly if we would like to place annotations above box plots plotted in one x-y plane (using facet_grid()). See the code below:
Suppose you want to change the annotation labels' height above box plots. The most efficient way is to check the current y.position in your tibble with statistical test results and modify it. Take a look at this example:
# Checking the current y.position of labels in the tibble with statistical results:
t.test.selected$y.position
Comparing the current position of the labels with these values, we see that the labels could be slightly lifted. We can use the mutate() function from the dplyr package to change these values:
# Changing the y.position entries using mutate():
t.test.selected <-
t.test.selected %>%
mutate(y.position = y.position + 3) # We lift all labels by 3 units up.
# Check the new y.position:
t.test.selected$y.position
Now, let's use the omnibus test results to create annotations - e.g., from the ANOVA. In the first step, we compute the test for all variables, adjust p-values for multiple comparisons, and add significance symbols:
According to the documentation, anova_test() returns an object of class anova_test - a data frame containing the ANOVA table for the basic type of ANOVA test. We change it into a tibble to simplify further steps:
# Changing the output from anova_test() into a tibble:
ANOVA <- as_tibble(ANOVA)
We round all p-values in the new 'ANOVA' tibble using p_round():
Now, if you check the output tibble from the last line of code - it still does not contain the following columns: group1, group2, xmin, xmax, and y.position. These columns are necessary to create an annotation by stat_pvalue_manual() and are created by add_xy_position(). Using mutate(), we can easily add all of them. If you want to use add_xy_position(), first, you must add group1 and group2 via mutate().
The group1 should contain N and group2 - T. The xmin and xmax define the annotation position on the x-axis. The y.position is related to the label height above box plots, bar charts, dot plots, etc. The y.position can be estimated based on the maximum concentration measured for each lipid, e.g., the label can be placed at a 20% higher value than the highest value of concentration measured. It is all achieved through the following code lines:
# Computing maximum concentration measured for each lipid:
max <-
data %>%
select(-`Sample Name`) %>%
pivot_longer(cols = `CE 16:1`:`SM 42:1;O2`,
names_to = "Lipids",
values_to = "Concentrations") %>%
group_by(Lipids) %>%
summarise(max = max(Concentrations))
# Adding group1, group2, xmin and xmax, and y.position to the ANOVA tibble:
ANOVA <-
ANOVA %>%
mutate(group1 = "N",
group2 = "T",
xmin = 1, # xmin is label no. 1 - N - you can also put "N".
xmax = 3, # xmax is label no. 3 - T - you can also put "T".
y.position = 1.2*max$max) # Let hang the annotations at 1.2 * max concentration
Now, having all the necessary columns in the ANOVA tibble, we can create our box plots with annotations:
# We filter the ANOVA tibble by rows to select lipids for plotting:
ANOVA.selected <-
ANOVA %>%
filter(Lipids == "SM 39:1;O2" |
Lipids == "SM 40:1;O2" |
Lipids == "SM 41:1;O2" |
Lipids == "SM 42:1;O2")
# Creating box plots with annotations for selected lipids:
data %>%
select(`Label`,
`SM 39:1;O2`,
`SM 40:1;O2`,
`SM 41:1;O2`,
`SM 42:1;O2`) %>%
pivot_longer(cols = `SM 39:1;O2`:`SM 42:1;O2`,
names_to = "Lipids",
values_to = "Concentrations") %>%
group_by(Lipids) %>%
ggboxplot(x = "Label",
y = "Concentrations",
fill = "Label",
add = "jitter",
add.params = list(shape = 23),
palette = c("royalblue", "orange", "red2")) +
stat_pvalue_manual(ANOVA.selected,
label = "p.adj.signif",
size = 7,
tip.length = 0.0) + # We set tip.length to 0 as ANOVA compares all three groups.
facet_grid(~ Lipids, scales = "free") +
theme(strip.placement = "outside",
strip.background = element_blank(),
panel.spacing.x = unit(0, 'cm'))
Finally, we can create a chart with asterisks corresponding to ranges of adjusted p-values from the ANOVA:
Or with adjusted p-values from the ANOVA test:
# Box plots with adjusted p-values from ANOVA:
data %>%
select(`Label`,
`SM 39:1;O2`,
`SM 40:1;O2`,
`SM 41:1;O2`,
`SM 42:1;O2`) %>%
pivot_longer(cols = `SM 39:1;O2`:`SM 42:1;O2`,
names_to = "Lipids",
values_to = "Concentrations") %>%
group_by(Lipids) %>%
ggboxplot(x = "Label",
y = "Concentrations",
fill = "Label",
add = "jitter",
add.params = list(shape = 23),
palette = c("royalblue", "orange", "red2")) +
stat_pvalue_manual(ANOVA.selected,
label = "p.adj", # Here, we need rounded p-values!
size = 5,
tip.length = 0.0) + # We set tip.length to 0 as ANOVA compares all three groups.
facet_grid(~ Lipids, scales = "free") +
theme(strip.placement = "outside",
strip.background = element_blank(),
panel.spacing.x = unit(0, 'cm'))
The plot:
In the case of the Kruskal-Wallis test, after computing the p-values, you can immediately add missing columns through mutate():
# Computing Kruskal-Wallis test:
KW.test <-
data.long %>%
group_by(Lipids) %>%
kruskal_test(Concentrations ~ Label) %>%
adjust_pvalue(method = "BH") %>%
add_significance(p.col = "p.adj") %>%
p_round(digits = 3)
# Finding a maximum concentration measured for each lipid:
max <-
data %>%
select(-`Sample Name`) %>%
pivot_longer(cols = `CE 16:1`:`SM 42:1;O2`,
names_to = "Lipids",
values_to = "Concentrations") %>%
group_by(Lipids) %>%
summarise(max = max(Concentrations))
# Adding group1, group2, xmin and xmax, and y.position to the ANOVA tibble:
KW.test <-
KW.test %>%
mutate(group1 = "N",
group2 = "T",
xmin = 1, # xmin is label no. 1 - N - you can also put "N".
xmax = 3, # xmax is label no. 3 - T - you can also put "T".
y.position = 1.2*max$max) # Let hang the annotations at 1.2 * max concentration
# We filter the Kruskal-Wallis tibble by rows to select lipids for plotting:
KW.selected <-
KW.test %>%
filter(Lipids == "SM 39:1;O2" |
Lipids == "SM 40:1;O2" |
Lipids == "SM 41:1;O2" |
Lipids == "SM 42:1;O2")
# Creating box plots for selected lipids with annotations:
data %>%
select(`Label`,
`SM 39:1;O2`,
`SM 40:1;O2`,
`SM 41:1;O2`,
`SM 42:1;O2`) %>%
pivot_longer(cols = `SM 39:1;O2`:`SM 42:1;O2`,
names_to = "Lipids",
values_to = "Concentrations") %>%
group_by(Lipids) %>%
ggboxplot(x = "Label",
y = "Concentrations",
fill = "Label",
add = "jitter",
add.params = list(shape = 23),
palette = c("royalblue", "orange", "red2")) +
stat_pvalue_manual(KW.selected,
label = "p.adj",
size = 5,
tip.length = 0.0) + # We set tip.length to 0 as Kruskal-Wallis compares all three groups.
facet_grid(~ Lipids, scales = "free") +
theme(strip.placement = "outside",
strip.background = element_blank(),
panel.spacing.x = unit(0, 'cm'))
And the final plot:
We can also use as annotations the results of the post hoc test. Let's compute the Tukey HSD post hoc test as an example (code would be similar in the case of the Dunn post hoc):
# Computing Tukey HSD post hoc test:
tukey.hsd <-
data %>%
pivot_longer(cols = `CE 16:1`:`SM 42:1;O2`,
names_to = "Lipids",
values_to = "Concentrations") %>%
group_by(Lipids) %>%
tukey_hsd(Concentrations ~ Label)
In the next step, from the tibble containing post hoc results for all lipids, we select rows concerning the lipids we would like to plot:
Using add_xy_position(), we can add to the tibble with post hoc results columns that will be used by stat_pvalue_manual() for plotting labels (labels' positions):
tukey.hsd.selected <-
tukey.hsd.selected %>%
add_y_position(scales = 'free') # Again, remember to set scales to "free".
Next, let's create the plot with labels:
# Creating box plots with post-hoc test results:
data %>%
select(`Label`,
`SM 39:1;O2`,
`SM 40:1;O2`,
`SM 41:1;O2`,
`SM 42:1;O2`) %>%
pivot_longer(cols = `SM 39:1;O2`:`SM 42:1;O2`,
names_to = "Lipids",
values_to = "Concentrations") %>%
group_by(Lipids) %>%
ggboxplot(x = "Label",
y = "Concentrations",
fill = "Label",
add = "jitter",
add.params = list(shape = 23),
palette = c("royalblue", 'orange', "red2")) +
stat_pvalue_manual(tukey.hsd.selected,
label = "p.adj.signif", # We select statistical significance symbols.
size = 6,
hide.ns = T) + # We do not want to plot labels if the difference is not significant.
facet_grid(~ Lipids,
scales = "free_y") +
theme(strip.placement = "outside",
strip.background = element_blank(),
panel.spacing.x = unit(0, 'cm'))
Our plot:
However, the labels overlay. We can fix it but adjusting the y-positions in the tibble 'tukey.hsd.selected'. Let's take a look at the tibble first:
The y-positions of labels annotating significant outcomes from the Tukey HSD test are too close (red frames). The labels for every lipid are arranged in the same order - look at group1 and group2 columns (highlighted in green, orange, and violet frames). This allows us to use the seq() function to select particular rows from the tibble in the y.position column:
# Selecting cells of tukey.hsd.selected:
tukey.hsd.selected[seq(from = 3, to = 12, by = 3), "y.position"]
Using this short line of code, we select every third row starting from row no 3. and continuing until row no. 12 in the column y.position. In fact, we select single cells corresponding to labels connecting PAN and T. See the output from the R console after running this line of code:
We can modify these entries, e.g., shift them 6-7 positions higher:
# Lifting PAN - T labels in tukey.hsd.selected tibble:
tukey.hsd.selected[seq(from = 3, to = 12, by = 3), "y.position"] <-
7 + tukey.hsd.selected[seq(from = 3, to = 12, by = 3), "y.position"]
We can also gently shift the N vs T labels, e.g., by two units up. We will now start modifications to every third entry from row no. 2:
# Lifting N - T labels in the tukey.hsd.selected tibble:
tukey.hsd.selected[seq(from = 2, to = 12, by = 3), "y.position"] <-
2 + tukey.hsd.selected[seq(from = 2, to = 12, by = 3), "y.position"]
NOTE: If you would like to apply changes to all entries, e.g. in the 'tukey.hsd' tibble, in the seq() set the argument to = nrows(<tibble_name>).
If you would like to select y.positions of your preference - you can also substitute the y.position values from add_xy_position() using:
# Setting y positions of your own choice:
tukey.hsd.selected$y.position <- c(...)
# Replace three dots with numbers corresponding to y-positions - separated by commas.
# How many numeric entries should contain this vector?
# The answer: number of entries = number of rows of tibble containing y.position.
If we now run the plot code, we obtain these beautiful box plots:
Publication-ready box plots with annotations from the ggstatsplot library
Elegant and publication-ready box plots with detailed annotations from hypothesis testing ( can be produced using ggstatsplot library's functions. Check the subchapter Basic plotting in R - Box plots to recall how to use this library.
Adding to ggplot2 charts annotations based on PMCMRplus package pairwise-comparisons (advanced)
If to the tibble with posthoc results obtained from looped PMCMRplus package functions, we add columns: xmin, xmax, y.position, and significance symbols, then using stat_pvalue_manual(), we can add statistical annotations in the ggplot2 charts computed through PMCMRplus package. Take a look at the code blocks below. As an example, we will use 'posthoc.Conover' tibble computed in the subchapter Multi-sample comparisons in R.
# Let's add Conover posthoc results to SM 39:1;O2, 40:1;O2, 41:1;O2, 42:1;O2 box plots.
# Quick check of Kruskal-Wallis test results for all lipids above:
kruskalTest(`SM 39:1;O2` ~ Label, data = data)
kruskalTest(`SM 40:1;O2` ~ Label, data = data)
kruskalTest(`SM 41:1;O2` ~ Label, data = data)
kruskalTest(`SM 42:1;O2` ~ Label, data = data)
# We reject the null hypothesis for all lipid species as p-value << 0.05.
# As least one group significantly differs from the rest.
# We proceed to the posthoc test - Conover test.
# The Conover posthoc was computed in Multi-sample comparisons in R - take a look.
##################################################################################
# Now, we have our posthoc.Conover tibble.
# First, we compute the maximum concentration for every of these lipids.
# Based on these concentrations, we assume the height of our label.
max <-
data %>%
select(-`Sample Name`) %>%
pivot_longer(cols = `CE 16:1`:`SM 42:1;O2`,
names_to = "Lipids",
values_to = "Concentrations") %>%
group_by(Lipids) %>%
summarise(max = max(Concentrations))
# The tibble 'max' contains single information for every lipid.
# Posthoc is performed 3x for one lipid.
# We need to repeat every entry 3x before adding it to posthoc.Conover tibble:
max <-
max %>%
slice(rep(1:n(), each = 3))
# We will arrange posthoc.Conover alphabetically using `Lipids` column:
posthoc.Conover <-
posthoc.Conover %>%
arrange(Lipids)
# Now, we can add new columns.
# Adding new columns to posthoc.Conover tibble (computed previously - look at Multi-sample comparisons in R):
posthoc.Conover <-
posthoc.Conover %>%
mutate(xmin = ifelse(group1 == "N",1, ifelse(group1 == "PAN",2,3)),
xmax = ifelse(group2 == "N",1, ifelse(group2 == "PAN",2,3)),
y.position = 1.2*max$max) # Let's initially hang the annotations at 1.2 * max concentration
# Final operation - we need to add significance symbols. We use mutate() with case_when():
posthoc.Conover <-
posthoc.Conover %>%
mutate(., p.signif = with(.,case_when(p.value < 0.0001 ~ "****",
0.0001 < p.value & p.value < 0.001 ~ "***",
0.001 < p.value & p.value < 0.01 ~ "**",
0.01 < p.value & p.value < 0.05 ~ "*",
0.05 < p.value & p.value < 1 ~ "ns")))
# Glimpse at the final tibble with posthoc results:
glimpse(posthoc.Conover)
The glimpse:
Now, let's filter out the results of the posthoc testing for lipids that we would like to plot:
# Selecting results of statistical tests to be plotted as annotations:
posthoc.Conover.filtered <-
posthoc.Conover %>%
filter(Lipids == "SM 39:1;O2" |
Lipids == "SM 40:1;O2" |
Lipids == "SM 41:1;O2" |
Lipids == "SM 42:1;O2")
The tibble with t-test results and column names expected by the stat_pvalue_manual().
Box plots with a raw p-value annotated above through stat_pvalue_manual() from the ggpubr.
Box plots for SM 41:1;O2 with t-test results added as annotations: in the left panel - Benjamini-Hochberg adjusted p-value, in the right panel with asterisks corresponding to the adjusted p-value ranges.
The ggpubr box plots with annotations from hypothesis testing added through stat_pvalue_manual().
The multi-panel ggplot2 box plots with annotations from the t-test added using stat_pvalue_manual() with free scales.
The ggplot2 box plots plotted in one x-y plane through facet_grid(). Raw p-values were added through stat_pvalue_manual().
The ggpubr box plots plotted in one x-y plane through facet_grid(). Raw p-values were added through stat_pvalue_manual().
Adjusting the y.position through mutate - lifting the annotations up by three units.
Box plots presenting concentrations of selected lipids with asterisk annotations corresponding to adjusted p-value ranges from the ANOVA test across all three groups - N, PAN, and T.
Box plots presenting concentrations of selected lipids with adjusted p-values from the ANOVA test across all three groups - N, PAN, and T.
Box plots presenting concentrations of selected lipids with adjusted p-values from the Kruskal-Wallis test across all three groups - N, PAN, and T.
Adding post-hoc test results as annotations above box plots.
A glimpse at the 'tukey.hsd.selected' tibble.
The ggpubr box plots with post hoc test results added through stat_pvalue_manual().
The ggplot2 bar plots with post hoc test results added through stat_pvalue_manual().
The 'posthoc.Conover' tibble ready for plotting annotations in the ggplot2 charts.
The ggplot2 box plots with Conover posthoc test results (PMCMRplus package) before correcting the height of label.
The ggplot2 Tukey box plots with Conover posthoc test results (PMCMRplus package).
The ggplot2 adjusted box plots for skewed distributions with Conover posthoc test results (PMCMRplus package).