Using gtsummary to create publication-ready tables

Metabolites and lipids descriptive statistical analysis in R

Basic gtsummary table

Except for visualizations, tables can be used to present the most interesting alterations in features' levels or to summarize your data set (e.g., clinical parameters). Check out the following example:

D. Zhu et al. Lipidomics Profiling and Risk of Coronary Artery Disease in the BioHEART-CT Discovery Cohort. DOI: https://doi.org/10.3390/biom13060917 - Table 1.

The gtsummary package via the `tbl_summary()` function can create beautiful, publication-ready tables filled with descriptive statistics for the most important features. Here, we will show you how to use gtsummary to prepare these tables. More information about the gtsummary can be found here:

The gtsummary works with all functions from the tidyverse collection. Let's assume the situation that we already know based on the exploratory analysis that in our lipidomics data set we have interesting alterations in long-chain sphingomyelin (SM) profiles in PDAC. We want to present these trends in the manuscript as a table. We decided to show the data as a median with an interquartile range (IQR) and apply a non-parametric test to compare all distributions (in this case the Kruskal-Wallis test). A publication ready-table can be prepared using nearly a single line of code(!):

# Installation of gtsummary:
install.packages("gtsummary")

# Calling library:
library(gtsummary)

# Investigate the tbl_summary():
?tbl_summary()

# Creating publication-ready table via tbl_summary:
data %>%
  select(`Label`, `SM 39:1;O2`, starts_with("SM 4")) %>%
  tbl_summary(by = `Label`) %>%
  add_p()
  
# Explanations:
# From the 'data' in the global environment, select columns:
# 1. `Label`,`SM 39:1;O2`,
# 2. All columns whose names start with 'SM 4'.
# 3. Pipe them to tbl_summary() function, which groups by 'Label' column and computes median with IQR.
# 4. Pipe the results to add_p().
# 4. Compare all outcomes using the KW statistical test (add_p()).

This elegant table is obtained in the effect:

The tbl_summary() is very flexible in terms of customization. Suppose we would like to present results as mean concentration with standard deviation rounded to the first decimal place, and ANOVA test for the comparison of these three means. Also, we do not particularly like the column name: 'Characteristic', and we want to change it to 'SM species' in bold. We can use the following code:

# Preparing table with mean (sd) and ANOVA for every lipid:
data %>%
  select(`Label`, `SM 39:1;O2`, starts_with("SM 4")) %>%
  tbl_summary(by = `Label`,
              statistic = all_continuous() ~ c("{mean} ({sd})"),
              digits = all_continuous() ~ 1) %>%
  add_p(all_continuous() ~ 'aov') %>%
  modify_header(label = "**SM species**")

The output:

The tbl_summary() function is constructed in the following way:

As you already realize, if one operation is to be performed, it is not necessary to use a list(). In the example above, we will change the type of SM 39:1;O2 to 'continuous2' only - and we do not need to use a list. In turn, while computing statistics, first we want to compute summary for all_continuous() variables, and in the next step, we will compute statistics for SM 39:1;O2 which is now a 'continuous2' variable - in this case, we need to merge all operations into one list. Based on the figure above, you also see that on the left side, we define what variables should be affected (red frame), and on the right - how they should be affected. To define what variables should be affected - we can use the column name, e.g. SM 39:1;O2, or we can immediately affect all variables of the same type via all_continuous(). To separate what variables should be affected from how they should be affected, a "tilde" symbol is used: "~":

all_continuous() ~ c("{mean}, {sd}"))

Now, look at the right side. As we compute more than one type of statistics, we need to define all of them in c(). Functions in the tbl_summary() are usually defined in {}, e.g. {mean} or {sd}. Finally, we can define how the statistics will be presented in the table, e.g. the form:

c("{function_1} ({function_2})")

will result in:

output_1 (output_2)

for example:

c("{mean} ({sd})")

is:

mean value (standard deviation) -> 4.7 (1.7)

While:

c("{function_1}, {function_2}")

is:

output_1, output_2

for example:

c("{mean}, {sd}")

is:

mean value, standard deviation -> 4.7, 1.7

What functions can be used to compute summary statistics? According to the function documentation (?tbl_summary()), there is a long list of summary statistics the tbl_summary() can compute for you for continuous variables, e.g.:

{median} median
{mean} mean
{sd} standard deviation
{var} variance
{min} minimum
{max} maximum
{sum} sum
⁠{p##}⁠ any integer percentile, where ⁠##⁠ is an integer from 0 to 100
{foo} any function of the form foo(x) is accepted where x is a numeric vector

For the categorical variables you can obtain:

{n} frequency
{N} denominator, or cohort size
{p} formatted percentage

Moreover, for categorical and continuous variables, statistics regarding the number of missing and non-missing entries and their proportions can also be presented:

{N_obs} total number of observations
{N_miss} number of missing observations
{N_nonmiss} number of non-missing observations
{p_miss} percentage of observations missing
{p_nonmiss} percentage of observations not missing

One additional important matter: as you have seen in the example above, we can change the type of continuous variable - concentrations of SM 39:1;O2 stored in the column SM 39:1;O2 to continuous2 variable. For continuous2 variables, you can show summaries in two or more table rows, e.g. mean with standard deviation and range (min, max) in two separate rows.

Additional summary statistics or statistical tests can be also computed via the add_...() functions, e.g. add_ci() can be used to compute confidence intervals, or add_p() to compute p-value from a statistical test.

The add_p() was used in the examples above too. Statistical tests can be selected via test argument in the add_p() function:

# Changing test type in the add_p() function:
add_p(..., test = 
       list(all_categorical() ~ "chisq.test",
            all_continuous() ~ "wilcox.test"))

Let's assume that we want to compare in the table only medians of long-chain SM concentrations measured for volunteers and patients with pancreatic cancer using the Wilcoxon rank sum test. The concentrations in the table should be shown as median with interquartile range (IQR). Here is the code:

# Adjusting the add_p() function, example:
data %>%
  filter(Label == 'N' | Label == 'T') %>%
  droplevels() %>%                             # WE NEED TO DROP EMPTY FACTOR "PAN"!!!
  select(`Label`, `SM 39:1;O2`, starts_with("SM 4")) %>%
  tbl_summary(by = `Label`) %>%
  add_p(all_continuous() ~ 'wilcox.test')

The output:

Now, one more example - let's assume we want to perform the Wilcoxon rank sum test for all continuous variables except for SM 39:1;O2 - here we want to perform the classic t-test. We need to apply the following modifications to our code:

data %>%
  filter(Label == 'N' | Label == 'T') %>%
  droplevels() %>%
  select(`Label`, `SM 39:1;O2`, starts_with("SM 4")) %>%
  tbl_summary(by = `Label`) %>%
  add_p(test = 
          list(all_continuous() ~ 'wilcox.test',
               `SM 39:1;O2` ~ 't.test'),
        test.args = `SM 39:1;O2` ~ list(var.equal = T))

Without specifying in test.args that variances are assumed equal (var.equal = T), which will be passed to the function computing t-test, a Welch test would be performed instead. The output:

More examples of the application of the add_p() function are also presented below.

Preparing gtsummary table for the complete patient information (advanced)

Now, let's use a different data set, where more clinical information is available. You can download it here:

Read the data set into R as tibble 'data.ccRCC', and correct variable types (if necessary):

Label should be a factor,
gender should be a factor,
Age should be a numeric variable,
BMI should be a numeric variable,
Type of tumor should be a factor,
Tumor grade should be a factor,
All lipid concentrations should be numeric.

Now, we want to summarize patients' information in a gtsummary table:

# Reading the data set into R:
data.ccRCC <- readxl::read_xlsx(file.choose())

# Adjusting column types:
data.ccRCC$Label <- as.factor(data.ccRCC$Label)
data.ccRCC$Gender <- as.factor(data.ccRCC$Gender)
data.ccRCC$`Type of tumor`<- as.factor(data.ccRCC$`Type of tumor`)
data.ccRCC$`Tumor grade` <- as.factor(data.ccRCC$`Tumor grade`)

# Creating summary table for patients' information:
data.ccRCC %>%
  select(`Label`,
         `Gender`,
         `Age`,
         `BMI`,
         `Type of tumor`,
         `Tumor grade`,
         `Collected samples`) %>%
  tbl_summary(by = `Label`,
              statistic = all_continuous() ~ c("{mean} ({sd})"),
              digits = all_continuous() ~ 1)

The simple tbl_summary() function produces this output:

This table looks good, but it could benefit from further adjustments, e.g.:

age and BMI provided only as mean and standard deviation could be characterized better, e.g., we could add at least the range of values for both parameters,
the 'Characteristic' name of the first column is back,
for continuous2 variables - no statistical test was applied to compare samples of populations - the t-test should be applied here,
also, we would be interested if Pearson’s Chi-squared test will detect differences in the numbers of male and female participants (carefully! - it is a dichotomous variable),
no tests should be performed for Type of tumor, Tumor grade, and Collected samples,
if a statistical test is considered - we would also like to see the p-value corrected for multiple comparisons (Bonferroni correction),
all labels should be in bold to clearly differentiate them from the rest of the entries,
let's assume that we are not interested in the number of missing entries,
for Collected samples - without a description of P, U, T - we can only assume to what sample types refer these annotations - we need to modify the footnote,
for healthy volunteers - as we do not expect any entries in the Type of tumor or Tumor grade - we could remove the zero values.

Now, let's turn all these remarks into code step-by-step. Complementary to tbl_summary() functions, add_... and modify_... allows extending the table and modifying its content. We will use some of them here.

We will begin by adding the ranges to means and standard deviations for Age and BMI. We need to change the type of Age and BMI from continuous to continuous2, so additional rows in the table can be added for these variables. Then, in the c() we can add additional functions we would like to use for these variables and a form in which the values should appear in the table. The code after modifications:

# Adding range to `Age` and `BMI`:
data.ccRCC %>%
  select(`Label`,
         `Gender`,
         `Age`,
         `BMI`,
         `Type of tumor`,
         `Tumor grade`,
         `Collected samples`) %>%
  tbl_summary(by = `Label`,
              type = list(`Age` ~ 'continuous2',
                          `BMI` ~ 'continuous2'),
              statistic = list(`Age` ~ c("{mean} ({sd})", "{min}, {max}"),
                               `BMI` ~ c("{mean} ({sd})", "{min}, {max}")),
              digits = all_continuous() ~ 1)

And the output:

In the next step, the 'Characteristic' header should be substituted with 'Information'. Here, the modify_... family of functions would be useful - the modify_header() to be precise. We will need to access to label, change it, and bold it:

data.ccRCC %>%
  select(`Label`,
         `Gender`,
         `Age`,
         `BMI`,
         `Type of tumor`,
         `Tumor grade`,
         `Collected samples`) %>%
  tbl_summary(by = `Label`,
              type = list(`Age` ~ 'continuous2',
                          `BMI` ~ 'continuous2'),
              statistic = list(`Age` ~ c("{mean} ({sd})", "{min}, {max}"),
                               `BMI` ~ c("{mean} ({sd})", "{min}, {max}")),
              digits = all_continuous() ~ 1) %>% 
  modify_header(label = '**Information**')

The application of ** before and after the new header title will bold it. The output:

Next, the statistical test results should be added. For all continuous2 variables (p-value from t-test). For Gender, Pearson’s Chi-squared test should be performed. For the remaining variables - no statistical test should be performed. If we do not exclude the variables from the testing, the add_p() function will select and perform tests automatically. The obtained p-values should also be presented with the Bonferroni correction.

Adding additional columns with statistics can be achieved through add_... functions, here: add_p() and add_q(). In the add_p() function, we need to specify the t-test for all continuous2 variables and Pearson’s Chi-squared test for all dichotomous variables, and what variables should be excluded from the testing; test types are specified in the following way:

all_continuous2() ~ 't.test'

all_dichotomous() ~ 'chisq.test'

If we want to specify more than one test type in add_p(), we need to additionally use a test argument in the add_p() function and add all tests as a list():

add_p(..., test = list(<here specify all tests>))

for example:

add_p(..., test = list(all_continuous2() ~ 't.test', all_dichotomous() ~ 'chisq.test'))

We need to characterize the type of t-test. In this case, we will use a classic two-sample t-test, so we need to add the argument test.args in the add_p() function:

add_p(..., test.arg = all_continuous2() ~ list(var.equal = TRUE))

Finally, to exclude variables from testing we need to use the include argument, here:

add_p(..., include = -c(`Type of tumor`, `Collected samples`, `Tumor grade`))

To add the corrected p-value, we pipe the output to add_q() function. In the add_q() function, we can select the correction method through the method argument. We select 'bonferroni':

add_q(method = 'bonferroni')

In summary, the code will look like this:

data.ccRCC %>%
  select(`Label`,
         `Gender`,
         `Age`,
         `BMI`,
         `Type of tumor`,
         `Tumor grade`,
         `Collected samples`) %>%
  tbl_summary(by = `Label`,
              type = list(`Age` ~ 'continuous2',
                          `BMI` ~ 'continuous2'),
              statistic = list(`Age` ~ c("{mean} ({sd})", "{min}, {max}"),
                               `BMI` ~ c("{mean} ({sd})", "{min}, {max}")),
              digits = all_continuous() ~ 1) %>% 
  modify_header(label = '**Information**') %>%
  add_p(include=-c(`Type of tumor`,`Tumor grade`,`Collected samples`),
        test = list(all_continuous2() ~ 't.test',
                    all_dichotomous() ~ 'chisq. test'),
        test.args = all_continuous2() ~ list(var.equal = T)) %>%
  add_q(method = 'bonferroni')

And the output:

In one step we will bold all labels (simply pipe output to bold_labels()) and remove missing observations from the table (missing argument in tbl_summary() set to 'no'):

data.ccRCC %>%
  select(`Label`,
         `Gender`,
         `Age`,
         `BMI`,
         `Type of tumor`,
         `Tumor grade`,
         `Collected samples`) %>%
  tbl_summary(by = `Label`,
              type = list(`Age` ~ 'continuous2',
                          `BMI` ~ 'continuous2'),
              statistic = list(`Age` ~ c("{mean} ({sd})", "{min}, {max}"),
                               `BMI` ~ c("{mean} ({sd})", "{min}, {max}")),
              digits = all_continuous() ~ 1,
              missing = 'no') %>% 
  modify_header(label = '**Information**') %>%
  add_p(include=-c(`Type of tumor`,`Tumor grade`,`Collected samples`),
        test = list(all_continuous2() ~ 't.test',
                    all_dichotomous() ~ 'chisq. test'),
        test.args = all_continuous2() ~ list(var.equal = T)) %>%
  add_q(method = 'bonferroni') %>%
  bold_labels()

The output:

Next, we want to modify the footnote to add the description of "P", "T", and "U". This is the first more complex task. We need to reference a specific cell of the final table so we can add the additional description in the footnote. The cell of interest is this one containing the label - Collected samples. The function which will be helpful here is modify_table_styling(). According to the documentation of this function, the arguments of this function enable accessing the tibble 'table_body', which will then be printed as the final, publication-ready table. Let's save our final output from tbl_summary() as a 'table'. The new object in the global environment appeared, and it is a list of 8. Now we can type:

table$table_body

The output:

To reference the cell containing the label 'Collected samples' in the table_body, we can indicate to the modify_table_styling() function that we are interested in the column label, and in this column, we need to find a row containing the label named: 'Collected samples'. The argument rows uses predicate expression (TRUE/FALSE) to find out if a correct row is selected, i.e. label == 'Collected sample': returns TRUE or returns FALSE. Here is the code:

... %>%
modify_table_styling(columns = label,
                     rows = label == "Collected samples")

If TRUE was returned (now, we reference our cell interest), then the footnote will be updated with a string supplied via the footnote argument:

... %>%
modify_table_styling(columns = label,
                     rows = label == "Collected samples",
                     footnote = "P - plasma, T - tissue, U - urine")

The updated code to modify the footnote of our final table:

data.ccRCC %>%
  select(`Label`,
         `Gender`,
         `Age`,
         `BMI`,
         `Type of tumor`,
         `Tumor grade`,
         `Collected samples`) %>%
  tbl_summary(by = `Label`,
              type = list(`Age` ~ 'continuous2',
                          `BMI` ~ 'continuous2'),
              statistic = list(`Age` ~ c("{mean} ({sd})", "{min}, {max}"),
                               `BMI` ~ c("{mean} ({sd})", "{min}, {max}")),
              digits = all_continuous() ~ 1,
              missing = 'no') %>% 
  modify_header(label = '**Information**') %>%
  add_p(include=-c(`Type of tumor`,`Tumor grade`,`Collected samples`),
        test = list(all_continuous2() ~ 't.test',
                    all_dichotomous() ~ 'chisq. test'),
        test.args = all_continuous2() ~ list(var.equal = T)) %>%
  add_q(method = 'bonferroni') %>%
  bold_labels() %>%
  modify_table_styling(
    columns = label,
    rows = label == "Collected samples",
    footnote = "P - plasma, T - tissue, U - urine")

The output of this code:

Finally, we would like to remove the zero entries in Type of tumor and Tumor grade. We will substitute these entries with a minus symbol "-", meaning - no observations. Now, we need to introduce changes in the table body. We can do it via modify_table_body() function. Based on the documentation, this function can be used together with dplyr functions (like arrange(), mutate(), etc.) to introduce changes in the table body in the following way:

modify_table_body(
.x %>%
<dplyr_function>

# For example:
modify_table_body(
.x %>%
arrange(variable)

Here, we will need to use it together with mutate() across all columns containing statistical summary (all_stat_cols() from the gtsummary). Now, we would need a tool that could recognize every string in these columns starting with 0. and then containing any number of characters until whitespace, so we can change it into "-". Such a function is gsub(). The gsub() function application is relatively simple:

gsub(pattern, replacement, in what vector or data frame)

The gsub understands regular expressions also known as regex. Regular expressions are sequences of characters describing certain patterns in a text. This sentence: every string starting with 0. and then containing any number of characters until a whitespace is represented by the following regex: ^0.*

Therefore, our gsub function can be modified to:

gsub("^0.*", "-", x)

If we implement the gsub() function into mutate() and modify_table_body() we obtain:

modify_table_body(
    ~.x %>% 
      mutate(across(all_stat_cols(), ~gsub("^0.*", "-", .x)))
    )

And merging it with our code into a final form:

data.ccRCC %>%
  select(`Label`,
         `Gender`,
         `Age`,
         `BMI`,
         `Type of tumor`,
         `Tumor grade`,
         `Collected samples`) %>%
  tbl_summary(by = `Label`,
              type = list(`Age` ~ 'continuous2',
                          `BMI` ~ 'continuous2'),
              statistic = list(`Age` ~ c("{mean} ({sd})", "{min}, {max}"),
                               `BMI` ~ c("{mean} ({sd})", "{min}, {max}")),
              digits = all_continuous() ~ 1,
              missing = 'no') %>% 
  modify_header(label = '**Information**') %>%
  add_p(include=-c(`Type of tumor`,`Tumor grade`,`Collected samples`),
        test = list(all_continuous2() ~ 't.test',
                    all_dichotomous() ~ 'chisq. test'),
        test.args = all_continuous2() ~ list(var.equal = T)) %>%
  add_q(method = 'bonferroni') %>%
  bold_labels() %>%
  modify_table_styling(
    columns = label,
    rows = label == "Collected samples",
    footnote = "P - plasma, T - tissue, U - urine") %>%
  modify_table_body(
    ~.x %>% 
      mutate(across(all_stat_cols(), ~gsub("^0.*", "-", .x)))
    )

Our final table looks like this:

Exporting the final table with a gt package

To export the final table from RStudio, we will need to install the package called 'gt'.

# Installing gt package
install.packages('gt')

# Calling the library
library(gt)

We can create the following formats in our working directory:

.html
.png
.pdf
.tex, .rnw
.rtf
.docx

To save your chart, add these lines of code:

... %>%
as_gt %>%
  gt::gtsave(filename = 'table.png') # Under filename, specify the name and format

For our table from the previous example, we would add:

data.ccRCC %>%
  select(`Label`,
         `Gender`,
         `Age`,
         `BMI`,
         `Type of tumor`,
         `Tumor grade`,
         `Collected samples`) %>%
  tbl_summary(by = `Label`,
              type = list(`Age` ~ 'continuous2',
                          `BMI` ~ 'continuous2'),
              statistic = list(`Age` ~ c("{mean} ({sd})", "{min}, {max}"),
                               `BMI` ~ c("{mean} ({sd})", "{min}, {max}")),
              digits = all_continuous() ~ 1,
              missing = 'no') %>% 
  modify_header(label = '**Information**') %>%
  add_p(include=-c(`Type of tumor`,`Tumor grade`,`Collected samples`),
        test = list(all_continuous2() ~ 't.test',
                    all_dichotomous() ~ 'chisq. test'),
        test.args = all_continuous2() ~ list(var.equal = T)) %>%
  add_q(method = 'bonferroni') %>%
  bold_labels() %>%
  modify_table_styling(
    columns = label,
    rows = label == "Collected samples",
    footnote = "P - plasma, T - tissue, U - urine") %>%
  modify_table_body(
    ~.x %>% 
      mutate(across(all_stat_cols(), ~gsub("^0.*", "-", .x)))
    ) %>%
  as_gt %>%
  gt::gtsave(filename = 'table_example.png')

The plot will be ready in your working directory (wd). You can always check your current working directory in this way:

# Checking your current working directory (wd):
getwd()

Additional references

We highly recommend watching this lecture by Daniel D. Sjoberg:

But also reading and citing the paper published in The R Journal:

PreviousComputing descriptive statistics in R NextBasic plotting in R

Last updated 3 months ago