Computing descriptive statistics in R

Metabolites and lipids descriptive statistical analysis in R

R offers a variety of libraries for computing descriptive statistics like central tendency measures (mean, median, mode, ...) and uncertainty or spread measures (standard deviation, standard error of the mean, interquartile range, ...). Here, we will present six selected packages that provide a wide range of descriptive statistics parameters via simple code. We advise you to analyze all solutions and select the most suitable one to begin with.

Important: Certain parameters (e.g.: skewness and kurtosis), can be defined in a different way across different packages. Always carefully read the function documentation and vignette to understand in what way the summary statistics you are interested in are calculated! It will be necessary to explain these computations in the materials and methods of your paper, in the section concerning statistical analysis. Additionally, be sure that you also understand the meaning of the value obtained in the computations! Find additional information or check references suggested by authors of a package if the description of computations in documentation is too brief for you.

Data set used for descriptive statistics computations

For all basic computations, we rely on the PDAC data set (called in R data), which can be downloaded from the Introduction section (subchapter "Example data sets") or from the link below:

This PDAC data set is used for all computations and visualizations throughout the GitBook. Use it in R for descriptive statistics computations and introduce it as data.

dlookr package

The dlookr package was discontinued on CRAN. However, it can still be downloaded from GitHub:

# Installing dlookr package
install.packages(“devtools”)
devtools::install_github("choonghyunryu/dlookr")

# Check the vignette of this package: https://choonghyunryu.github.io/dlookr/

It can be applied for diagnosing data quality, imputation of missing entries, data transformation, and exploratory data analysis. The package overview is available here:

The dlookr package overview.

The descriptive statistics can be computed using the describe() function:

# Calling the library and tidyverse collection:
library(dlookr)
library(tidyverse)

# Documentation regarding describe() function:
?describe()

# Using the describe() function:
ds.dlookr.v1 <- describe(data)

print(ds.dlookr.v1)

This way, we use our wide tibble 'data' for computing descriptive statistics and we obtain this output:

Descriptive statistics computed using describe() function from the dlookr package.

As you see, the simple function can provide you with the total number of samples (n), the number of missing values for every feature (na), mean, median (as p50), standard deviation (sd), standard error of the mean (se_mean), interquartile range (IQR), measures of distribution asymmetry like skewness or how heavy are distribution tails ('tailedness') like kurtosis, and quantiles/percentages. The describe() function allows for modifying the content of the output tibble via statistics and quantiles:

# Using the describe() function - modifications of the output:
ds.dlookr.v2 <- describe(data, 
                         statistics = c("mean", 
                                        "sd", 
                                        "skewness", 
                                        "kurtosis",
                                        "quantiles"),
                         quantiles = c(0, 0.25, 0.5,0.75,1)
                         ) 

print(ds.dlookr.v2)

Results in:

Customization of descriptive statistics calculations in describe() function from the dlookr package.

However, all parameters we computed characterize features without considering biological groups. Usually, -omics scientists are more interested in descriptive statistics computed for each biological group separately. According to the describe() function documentation, we can also introduce a grouped tibble for computations (using group_by() function from the tidyverse collection). It results in an output tibble presenting descriptive statistics for each biological group:

# Computing descriptive statistics for each group via describe():
ds.dlookr.v3 <- 
  data %>% 
  group_by(Label) %>%
  describe()

print(ds.dlookr.v3)

The function computes now all parameters for each biological group separately:

Descriptive statistics calculations for each biological group separately via describe() function from the dlookr package.

psych package

The psych package was primarily developed for psychology-related sciences. It contains useful statistical tools, which can be applied more broadly than only to data collected within psychological studies. More information about the package can be found here:

The information about the psych package.

The psych package contains the function for computing descriptive statistics called the same as the dlookr package - describe():

# Installing psych package:
install.packages("psych")

# Calling library
library(psych)

# Investigating package vignette:
?describe()

# Computing general descriptive statistics:
ds.psych.v1 <- describe(data)

print(ds.psych.v1)

The output:

General descriptive statistics via describe() function from the psych package.

Among parameters obtained via describe() function from the psych package are: the total number of observations (n), mean, standard deviation (sd), median, trimmed mean (mean calculated after removing a part of observations at the lower and upper end of the distribution), mean absolute deviation (mad), which is a measure of variability and represents the distance between all observations and their mean, minimum and maximum values (min, max), and their difference (range), skewness, kurtosis, and standard error (se) - which is a ratio of standard deviation and the square root of the total number of observations.

The describeBy() allows computing descriptive statistics for each biological group separately:

# Checking documentation for the 'describeBy():
?describeBy()

# Application of the describeBy.
# Computing descriptive statistics for each biological group:
ds.psych.v2 <- describeBy(data, group = data$Label)

print(ds.psych.v2)

By specifying that our grouping variable is Label (group = data$Label), three separate data frames are computed for N, PAN, and T:

Descriptive statistics computed for biological groups via the describeBy() function from the psych package.

summarytools package

The summarytools is a simple package providing functions for data exploration and reporting. It can be used for a fast recheck of descriptive statistics for a selected feature. For more information, check the "Introduction to summarytools" on the CRAN:

The summarytools package at CRAN.

For computing descriptive statistics, the descr() function:

# Installation of the summarytools
install.packages("summarytools")

# Calling library
library(summarytools)

# Investigating the function of interest:
?descr()

# Computing descriptive statistics for each biological group:
ds.summarytools <- 
  data %>% 
  pivot_longer(cols = `CE 16:1`:`SM 42:1;O2`, 
               names_to = 'Lipids', 
               values_to = 'Concentration') %>% 
  group_by(Lipids, Label) %>%
  descr()
  
print(ds.summarytools$`Lipids = CE 16:0, Label = 1`)
print(ds.summarytools$`Lipids = CE 16:0, Label = 2`)
print(ds.summarytools$`Lipids = CE 16:0, Label = 3`)

The output:

Computing descriptive statistics via descr() function from the summarytools package.

rstatix package

The rstatix package is an excellent and complete solution for a brief data inspection and computing descriptive and univariate statistics of -omics data. The library provides functions fully compatible with the tidyverse collection. To learn more about the rstatix package visit:

Introduction to the rstatix libarary.

Here, we will frequently rely on the rstatix library because it provides tidy tibbles that can be further used for data visualization.

The get_summary_stats() function provides all necessary descriptive statistics to be reported in a publication, including mean, median, standard deviation, standard error of the mean, mean absolute deviation, interquartile range, Q1 and Q3, minimum and maximum values, and a 95% confidence interval of the mean. The output from the get_summary_stats() can be adjusted via the "type" argument, to be, for example: "mean_sd" or "median_iqr" only. We will compute "full" descriptive statistics offered by this package. For computing the descriptive statistics for all features, we can either use a wide tibble or a long tibble (via pivot_longer()). In both cases, we need to group entries - in the case of the wide table by the Label column, and for the long tibble: first by feature (Lipid column), and next by biological group (Label column):

# Installation of rstatix package:
install.packages("rstatix")

# Calling the libary:
library(rstatix)

# Checking the function of interest in the vignette:
?get_summary_stats()

# Computing descriptive statistics for each biological group via wide tibble:
ds.rstatix <-
  data %>% 
  group_by(Label) %>% 
  get_summary_stats() %>% 
  print(n=400)

# Computing descriptive statistics for each biological group via long tibble:
ds.rstatix <- 
  data %>% 
  pivot_longer(cols = `CE 16:1`:`SM 42:1;O2`, 
               names_to = 'Lipid', 
               values_to = 'Concentration') %>% 
  group_by(Lipid, Label) %>%
  get_summary_stats(type = 'full')
  
print(ds.rstatix)

The output:

Descriptive statistics computed using get_summary_stats() function from the rstatix library.

skimr package

The skimr package contains skim() function which is an effective alternative to the summary() function. summary() is a base R function that allows summarizing a vector, data frame, output from ANOVA calculations, or a model, etc.

Comarison: A/ the base R summary() function; B/ skim() from the skimr package.

The skim() function can be used for computing descriptive statistics. It quickly provides an overview of data frames, handling all data types, and applying summary functions based on the column types in a data frame subjected to skim. The summary functions can be selected and customized using skim_with() function. More information about the skimr package can be found here:

Introduction to the skimr package.

And here:

Using the skimr package (CRAN).

First, we will show how to use the skim() function to prepare general overviews:

# Installation of the skimr package:
install.packages("skimr")

# Calling library:
library(skimr)

# Checking the information about skim() in documentation:
?skim()

# Using skim() to compute fast summary statistics for the 'data' tibble:
ds.skim <-
  data %>%
  group_by(Label) %>%
  skim()
  
# Print results:
print(ds.skim)

A part of the output from the basic skim() function:

Descriptive statistics (summary statistics) computed via basic skim() function from the skimr package.

The skimr package also enables generating a fully customized skim via skim_with() function. The skim_with() returns a function we will use instead of the default skim(). We will store it as 'new_skim()'. First, we must define what data types will be subjected to what skim. In our case, we will define a skim to be performed on numeric columns only. Then, for these columns, using sfl() helper - what statistical parameters should be computed and via what functions. In the sfl() helper, we can use default functions (e.g., mean, median, sd, iqr, ...), functions from other packages (e.g., moments::kurtosis), or functions defined by us in R (e.g., ~range(.)). This way, our skim can be fully customized to compute any descriptive statistics we need. Using 'append' set to TRUE - we can add new parameters from sfl() to default ones computed via skim(), or by setting FALSE - we can create a skim containing only our new parameters from sfl(). The summary of skim_with function is presented below:

The summary presenting application of the skim_with() function for creating new_skim() function.

The application of skim_with() is presented below:

# Customizing skim - first read about the skim_with() function:
?skim_with()

# Install library 'parameters' to compute skewness and kurtosis:
install.packages("parameters")

# Install library 'moments' to compute sample moment:
install.packages("moments")

# Define an additional function for computing range:
range <- function(x) {max(x)-min(x)}

# Customize your skim:
new_skim <- skim_with(
  numeric = sfl(iqr = IQR, # *
                mad = mad, # *
                Q1 = ~quantile(., 0.25), # *
                median = median, # *
                Q3 = ~quantile(., 0.75), # *
                range = ~range(.), # **
                skewness = parameters::skewness,  # ***
                kurtosis.comp = parameters::kurtosis, # ***
                moment = moments::moment), # ***
  append = FALSE        
) 

# Explanations:
# * 
# iqr, mad, Q1, median, Q3 will be computed using base R functions.
# **
# range will be computed using a function defined in R and stored in the global environment.
# ***
# skewness, kurtosis, moment will be computed using libraries parameters (s,k), and moments (m).

# Apply new_skim() to skim out 'data' tibble:
data %>%
  group_by(Label) %>%
  new_skim()

The output:

Using customized skim - new_skim() - created via skim_with() function (from skimr() package).

dplyr package

The dplyr package provides a similar function to select(), filer(), mutate(), arrange(), slice(), and group_by(). It is called summarise(). You should be familiar with all of them at the beginning of the R application for -omics data analysis and visualization. Using summarise(), we can quickly prepare a tibble with descriptive statistics. More about summarise() here:

The statistical parameters are computed based on the functions supplied to summarise() in its arguments. To compute complete descriptive statistics, we will first transform a wide 'data' tibble into a long tibble with 4 columns (Sample Name, Label, Lipid, Concentration), then, our data will be grouped by Lipid and Label (biological group), and finally, we will pipe the data to the summarise() function to compute descriptive statistics using the Concentration column. Functions for computing descriptive statistics in summarise() can be R base functions, functions defined and stored in the global environment, or functions obtained from other packages (similarly to skim_with):

# Calling tidyverse collection:
library(tidyverse)

# Defining an additional function for computing a range of concentrations:
range <- function(x) {max(x)-min(x)}

# The application of summarise() function to compute descriptive statistics:
data %>% 
  pivot_longer(cols = `CE 16:1`:`SM 42:1;O2`, 
               names_to = 'Lipid', 
               values_to = 'Concentration') %>% 
  group_by(Lipid, Label) %>%
  summarise(mean = mean(Concentration),
            sd = sd(Concentration),
            min = min(Concentration),
            max = max(Concentration),
            range = range(Concentration),
            mad = mad (Concentration), 
            median = median(Concentration),
            iqr = IQR(Concentration),
            Q1 = quantile(Concentration, 0.25),
            Q2 = quantile(Concentration, 0.5),
            Q3 = quantile(Concentration, 0.75),
            skewness = parameters::skewness(Concentration),
            kurtosis.comp = parameters::kurtosis(Concentration),
            moment = moments::moment(Concentration))

The output:

A tibble filled with descriptive statistics computed for each biological group - obtained via summarise() from the dplyr library, base R functions, range() function defined by us in R, skewness() and kurtosis() functions from the parameters library, and moment() function from the moments package.

A tibble with descriptive statistics can be exported from R, e.g. as a .csv file:

# Saving the tibble with descriptive statistics in the working directory.
# Storing output from the summarise() function as 'ds.summarise':
data %>% 
  pivot_longer(cols = `CE 16:1`:`SM 42:1;O2`, 
               names_to = 'Lipid', 
               values_to = 'Concentration') %>% 
  group_by(Lipid, Label) %>%
  summarise(mean = mean(Concentration),
            sd = sd(Concentration),
            min = min(Concentration),
            max = max(Concentration),
            range = range(Concentration),
            mad = mad (Concentration), 
            median = median(Concentration),
            iqr = IQR(Concentration),
            Q1 = quantile(Concentration, 0.25),
            Q2 = quantile(Concentration, 0.5),
            Q3 = quantile(Concentration, 0.75),
            skewness = parameters::skewness(Concentration),
            kurtosis.comp = parameters::kurtosis(Concentration),
            moment = moments::moment(Concentration)) -> ds.summarise

# Saving 'ds.summarise' as a .csv file in the working directory:
write.csv(ds.summarise, "Descriptive_statistics_lipidomics_PDAC.csv")

Last updated