Using gtsummary to create publication-ready tables
Metabolites and lipids descriptive statistical analysis in R
Last updated
Metabolites and lipids descriptive statistical analysis in R
Last updated
Except for visualizations, tables can be used to present the most interesting alterations in features' levels or to summarize your data set (e.g., clinical parameters). Check out the following example:
D. Zhu et al. Lipidomics Profiling and Risk of Coronary Artery Disease in the BioHEART-CT Discovery Cohort. DOI: - Table 1.
The gtsummary package via the `tbl_summary()` function can create beautiful, publication-ready tables filled with descriptive statistics for the most important features. Here, we will show you how to use gtsummary to prepare these tables. More information about the gtsummary can be found here:
The gtsummary works with all functions from the tidyverse collection. Let's assume the situation that we already know based on the exploratory analysis that in our lipidomics data set we have interesting alterations in long-chain sphingomyelin (SM) profiles in PDAC. We want to present these trends in the manuscript as a table. We decided to show the data as a median with an interquartile range (IQR) and apply a non-parametric test to compare all distributions (in this case the Kruskal-Wallis test). A publication ready-table can be prepared using nearly a single line of code(!):
This elegant table is obtained in the effect:
The tbl_summary() is very flexible in terms of customization. Suppose we would like to present results as mean concentration with standard deviation rounded to the first decimal place, and ANOVA test for the comparison of these three means. Also, we do not particularly like the column name: 'Characteristic', and we want to change it to 'SM species' in bold. We can use the following code:
The output:
The tbl_summary() function is constructed in the following way:
As you already realize, if one operation is to be performed, it is not necessary to use a list(). In the example above, we will change the type of SM 39:1;O2
to 'continuous2' only - and we do not need to use a list. In turn, while computing statistics, first we want to compute summary for all_continuous() variables, and in the next step, we will compute statistics for SM 39:1;O2
which is now a 'continuous2' variable - in this case, we need to merge all operations into one list. Based on the figure above, you also see that on the left side, we define what variables should be affected (red frame), and on the right - how they should be affected. To define what variables should be affected - we can use the column name, e.g. SM 39:1;O2
, or we can immediately affect all variables of the same type via all_continuous(). To separate what variables should be affected from how they should be affected, a "tilde" symbol is used: "~":
Now, look at the right side. As we compute more than one type of statistics, we need to define all of them in c(). Functions in the tbl_summary() are usually defined in {}, e.g. {mean} or {sd}. Finally, we can define how the statistics will be presented in the table, e.g. the form:
will result in:
output_1 (output_2)
for example:
is:
mean value (standard deviation) -> 4.7 (1.7)
While:
is:
output_1, output_2
for example:
is:
mean value, standard deviation -> 4.7, 1.7
What functions can be used to compute summary statistics? According to the function documentation (?tbl_summary()), there is a long list of summary statistics the tbl_summary() can compute for you for continuous variables, e.g.:
{median}
median
{mean}
mean
{sd}
standard deviation
{var}
variance
{min}
minimum
{max}
maximum
{sum}
sum
{p##}
any integer percentile, where ##
is an integer from 0 to 100
{foo}
any function of the form foo(x)
is accepted where x
is a numeric vector
For the categorical variables you can obtain:
{n}
frequency
{N}
denominator, or cohort size
{p}
formatted percentage
Moreover, for categorical and continuous variables, statistics regarding the number of missing and non-missing entries and their proportions can also be presented:
{N_obs}
total number of observations
{N_miss}
number of missing observations
{N_nonmiss}
number of non-missing observations
{p_miss}
percentage of observations missing
{p_nonmiss}
percentage of observations not missing
One additional important matter: as you have seen in the example above, we can change the type of continuous variable - concentrations of SM 39:1;O2 stored in the column SM 39:1;O2
to continuous2 variable. For continuous2 variables, you can show summaries in two or more table rows, e.g. mean with standard deviation and range (min, max) in two separate rows.
Additional summary statistics or statistical tests can be also computed via the add_...() functions, e.g. add_ci() can be used to compute confidence intervals, or add_p() to compute p-value from a statistical test.
The add_p() was used in the examples above too. Statistical tests can be selected via test argument in the add_p() function:
Let's assume that we want to compare in the table only medians of long-chain SM concentrations measured for volunteers and patients with pancreatic cancer using the Wilcoxon rank sum test. The concentrations in the table should be shown as median with interquartile range (IQR). Here is the code:
The output:
Now, one more example - let's assume we want to perform the Wilcoxon rank sum test for all continuous variables except for SM 39:1;O2 - here we want to perform the classic t-test. We need to apply the following modifications to our code:
Without specifying in test.args that variances are assumed equal (var.equal = T), which will be passed to the function computing t-test, a Welch test would be performed instead. The output:
More examples of the application of the add_p() function are also presented below.
Now, let's use a different data set, where more clinical information is available. You can download it here:
Read the data set into R as tibble 'data.ccRCC', and correct variable types (if necessary):
Label
should be a factor,
gender
should be a factor,
Age
should be a numeric variable,
BMI
should be a numeric variable,
Type of tumor
should be a factor,
Tumor grade
should be a factor,
All lipid concentrations should be numeric.
Now, we want to summarize patients' information in a gtsummary table:
The simple tbl_summary() function produces this output:
This table looks good, but it could benefit from further adjustments, e.g.:
age and BMI provided only as mean and standard deviation could be characterized better, e.g., we could add at least the range of values for both parameters,
the 'Characteristic' name of the first column is back,
for continuous2 variables - no statistical test was applied to compare samples of populations - the t-test should be applied here,
also, we would be interested if Pearson’s Chi-squared test will detect differences in the numbers of male and female participants (carefully! - it is a dichotomous variable),
no tests should be performed for Type of tumor
, Tumor grade
, and Collected samples
,
if a statistical test is considered - we would also like to see the p-value corrected for multiple comparisons (Bonferroni correction),
all labels should be in bold to clearly differentiate them from the rest of the entries,
let's assume that we are not interested in the number of missing entries,
for Collected samples
- without a description of P, U, T - we can only assume to what sample types refer these annotations - we need to modify the footnote,
for healthy volunteers - as we do not expect any entries in the Type of tumor
or Tumor grade
- we could remove the zero values.
Now, let's turn all these remarks into code step-by-step. Complementary to tbl_summary() functions, add_... and modify_... allows extending the table and modifying its content. We will use some of them here.
We will begin by adding the ranges to means and standard deviations for Age
and BMI
. We need to change the type of Age
and BMI
from continuous to continuous2, so additional rows in the table can be added for these variables. Then, in the c() we can add additional functions we would like to use for these variables and a form in which the values should appear in the table. The code after modifications:
And the output:
In the next step, the 'Characteristic' header should be substituted with 'Information'. Here, the modify_... family of functions would be useful - the modify_header() to be precise. We will need to access to label, change it, and bold it:
The application of ** before and after the new header title will bold it. The output:
Next, the statistical test results should be added. For all continuous2 variables (p-value from t-test). For Gender
, Pearson’s Chi-squared test should be performed. For the remaining variables - no statistical test should be performed. If we do not exclude the variables from the testing, the add_p() function will select and perform tests automatically. The obtained p-values should also be presented with the Bonferroni correction.
Adding additional columns with statistics can be achieved through add_... functions, here: add_p() and add_q(). In the add_p() function, we need to specify the t-test for all continuous2 variables and Pearson’s Chi-squared test for all dichotomous variables, and what variables should be excluded from the testing; test types are specified in the following way:
or
If we want to specify more than one test type in add_p(), we need to additionally use a test argument in the add_p() function and add all tests as a list():
for example:
We need to characterize the type of t-test. In this case, we will use a classic two-sample t-test, so we need to add the argument test.args in the add_p() function:
Finally, to exclude variables from testing we need to use the include argument, here:
To add the corrected p-value, we pipe the output to add_q() function. In the add_q() function, we can select the correction method through the method argument. We select 'bonferroni':
In summary, the code will look like this:
And the output:
In one step we will bold all labels (simply pipe output to bold_labels()) and remove missing observations from the table (missing argument in tbl_summary() set to 'no'):
The output:
Next, we want to modify the footnote to add the description of "P", "T", and "U". This is the first more complex task. We need to reference a specific cell of the final table so we can add the additional description in the footnote. The cell of interest is this one containing the label - Collected samples. The function which will be helpful here is modify_table_styling(). According to the documentation of this function, the arguments of this function enable accessing the tibble 'table_body', which will then be printed as the final, publication-ready table. Let's save our final output from tbl_summary() as a 'table'. The new object in the global environment appeared, and it is a list of 8. Now we can type:
The output:
To reference the cell containing the label 'Collected samples' in the table_body, we can indicate to the modify_table_styling() function that we are interested in the column label
, and in this column, we need to find a row containing the label named: 'Collected samples'. The argument rows uses predicate expression (TRUE/FALSE) to find out if a correct row is selected, i.e. label == 'Collected sample': returns TRUE or returns FALSE. Here is the code:
If TRUE was returned (now, we reference our cell interest), then the footnote will be updated with a string supplied via the footnote argument:
The updated code to modify the footnote of our final table:
The output of this code:
Finally, we would like to remove the zero entries in Type of tumor
and Tumor grade
. We will substitute these entries with a minus symbol "-", meaning - no observations. Now, we need to introduce changes in the table body. We can do it via modify_table_body() function. Based on the documentation, this function can be used together with dplyr functions (like arrange(), mutate(), etc.) to introduce changes in the table body in the following way:
Here, we will need to use it together with mutate() across all columns containing statistical summary (all_stat_cols() from the gtsummary). Now, we would need a tool that could recognize every string in these columns starting with 0. and then containing any number of characters until whitespace, so we can change it into "-". Such a function is gsub(). The gsub() function application is relatively simple:
The gsub understands regular expressions also known as regex. Regular expressions are sequences of characters describing certain patterns in a text. This sentence: every string starting with 0. and then containing any number of characters until a whitespace is represented by the following regex: ^0.*
Therefore, our gsub function can be modified to:
If we implement the gsub() function into mutate() and modify_table_body() we obtain:
And merging it with our code into a final form:
Our final table looks like this:
To export the final table from RStudio, we will need to install the package called 'gt'.
We can create the following formats in our working directory:
.html
.png
.tex, .rnw
.rtf
.docx
To save your chart, add these lines of code:
For our table from the previous example, we would add:
The plot will be ready in your working directory (wd). You can always check your current working directory in this way:
We highly recommend watching this lecture by Daniel D. Sjoberg:
But also reading and citing the paper published in The R Journal:
Gender
, Age
, and BMI
.