💪
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  • Data set used for descriptive statistics computations
  • dlookr package
  • psych package
  • summarytools package
  • rstatix package
  • skimr package
  • dplyr package
  1. Metabolites and lipids descriptive statistical analysis in R

Computing descriptive statistics in R

Metabolites and lipids descriptive statistical analysis in R

PreviousData Transformation and scaling in PythonNextUsing gtsummary to create publication-ready tables

Last updated 2 months ago

R offers a variety of libraries for computing descriptive statistics like central tendency measures (mean, median, mode, ...) and uncertainty or spread measures (standard deviation, standard error of the mean, interquartile range, ...). Here, we will present six selected packages that provide a wide range of descriptive statistics parameters via simple code. We advise you to analyze all solutions and select the most suitable one to begin with.

Important: Certain parameters (e.g.: skewness and kurtosis), can be defined in a different way across different packages. Always carefully read the function documentation and vignette to understand in what way the summary statistics you are interested in are calculated! It will be necessary to explain these computations in the materials and methods of your paper, in the section concerning statistical analysis. Additionally, be sure that you also understand the meaning of the value obtained in the computations! Find additional information or check references suggested by authors of a package if the description of computations in documentation is too brief for you.

Data set used for descriptive statistics computations

For all basic computations, we rely on the PDAC data set (called in R data), which can be downloaded from the Introduction section (subchapter "Example data sets") or from the link below:

dlookr package

The dlookr package was discontinued on CRAN. However, it can still be downloaded from GitHub:

# Installing dlookr package
install.packages(“devtools”)
devtools::install_github("choonghyunryu/dlookr")

# Check the vignette of this package: https://choonghyunryu.github.io/dlookr/

It can be applied for diagnosing data quality, imputation of missing entries, data transformation, and exploratory data analysis. The package overview is available here:

The descriptive statistics can be computed using the describe() function:

# Calling the library and tidyverse collection:
library(dlookr)
library(tidyverse)

# Documentation regarding describe() function:
?describe()

# Using the describe() function:
ds.dlookr.v1 <- describe(data)

print(ds.dlookr.v1)

This way, we use our wide tibble 'data' for computing descriptive statistics and we obtain this output:

As you see, the simple function can provide you with the total number of samples (n), the number of missing values for every feature (na), mean, median (as p50), standard deviation (sd), standard error of the mean (se_mean), interquartile range (IQR), measures of distribution asymmetry like skewness or how heavy are distribution tails ('tailedness') like kurtosis, and quantiles/percentages. The describe() function allows for modifying the content of the output tibble via statistics and quantiles:

# Using the describe() function - modifications of the output:
ds.dlookr.v2 <- describe(data, 
                         statistics = c("mean", 
                                        "sd", 
                                        "skewness", 
                                        "kurtosis",
                                        "quantiles"),
                         quantiles = c(0, 0.25, 0.5,0.75,1)
                         ) 

print(ds.dlookr.v2)

Results in:

However, all parameters we computed characterize features without considering biological groups. Usually, -omics scientists are more interested in descriptive statistics computed for each biological group separately. According to the describe() function documentation, we can also introduce a grouped tibble for computations (using group_by() function from the tidyverse collection). It results in an output tibble presenting descriptive statistics for each biological group:

# Computing descriptive statistics for each group via describe():
ds.dlookr.v3 <- 
  data %>% 
  group_by(Label) %>%
  describe()

print(ds.dlookr.v3)

The function computes now all parameters for each biological group separately:

psych package

The psych package was primarily developed for psychology-related sciences. It contains useful statistical tools, which can be applied more broadly than only to data collected within psychological studies. More information about the package can be found here:

The psych package contains the function for computing descriptive statistics called the same as the dlookr package - describe():

# Installing psych package:
install.packages("psych")

# Calling library
library(psych)

# Investigating package vignette:
?describe()

# Computing general descriptive statistics:
ds.psych.v1 <- describe(data)

print(ds.psych.v1)

The output:

Among parameters obtained via describe() function from the psych package are: the total number of observations (n), mean, standard deviation (sd), median, trimmed mean (mean calculated after removing a part of observations at the lower and upper end of the distribution), mean absolute deviation (mad), which is a measure of variability and represents the distance between all observations and their mean, minimum and maximum values (min, max), and their difference (range), skewness, kurtosis, and standard error (se) - which is a ratio of standard deviation and the square root of the total number of observations.

The describeBy() allows computing descriptive statistics for each biological group separately:

# Checking documentation for the 'describeBy():
?describeBy()

# Application of the describeBy.
# Computing descriptive statistics for each biological group:
ds.psych.v2 <- describeBy(data, group = data$Label)

print(ds.psych.v2)

By specifying that our grouping variable is Label (group = data$Label), three separate data frames are computed for N, PAN, and T:

summarytools package

The summarytools is a simple package providing functions for data exploration and reporting. It can be used for a fast recheck of descriptive statistics for a selected feature. For more information, check the "Introduction to summarytools" on the CRAN:

For computing descriptive statistics, the descr() function:

# Installation of the summarytools
install.packages("summarytools")

# Calling library
library(summarytools)

# Investigating the function of interest:
?descr()

# Computing descriptive statistics for each biological group:
ds.summarytools <- 
  data %>% 
  pivot_longer(cols = `CE 16:1`:`SM 42:1;O2`, 
               names_to = 'Lipids', 
               values_to = 'Concentration') %>% 
  group_by(Lipids, Label) %>%
  descr()
  
print(ds.summarytools$`Lipids = CE 16:0, Label = 1`)
print(ds.summarytools$`Lipids = CE 16:0, Label = 2`)
print(ds.summarytools$`Lipids = CE 16:0, Label = 3`)

The output:

rstatix package

The rstatix package is an excellent and complete solution for a brief data inspection and computing descriptive and univariate statistics of -omics data. The library provides functions fully compatible with the tidyverse collection. To learn more about the rstatix package visit:

Here, we will frequently rely on the rstatix library because it provides tidy tibbles that can be further used for data visualization.

The get_summary_stats() function provides all necessary descriptive statistics to be reported in a publication, including mean, median, standard deviation, standard error of the mean, mean absolute deviation, interquartile range, Q1 and Q3, minimum and maximum values, and a 95% confidence interval of the mean. The output from the get_summary_stats() can be adjusted via the "type" argument, to be, for example: "mean_sd" or "median_iqr" only. We will compute "full" descriptive statistics offered by this package. For computing the descriptive statistics for all features, we can either use a wide tibble or a long tibble (via pivot_longer()). In both cases, we need to group entries - in the case of the wide table by the Label column, and for the long tibble: first by feature (Lipid column), and next by biological group (Label column):

# Installation of rstatix package:
install.packages("rstatix")

# Calling the libary:
library(rstatix)

# Checking the function of interest in the vignette:
?get_summary_stats()

# Computing descriptive statistics for each biological group via wide tibble:
ds.rstatix <-
  data %>% 
  group_by(Label) %>% 
  get_summary_stats() %>% 
  print(n=400)

# Computing descriptive statistics for each biological group via long tibble:
ds.rstatix <- 
  data %>% 
  pivot_longer(cols = `CE 16:1`:`SM 42:1;O2`, 
               names_to = 'Lipid', 
               values_to = 'Concentration') %>% 
  group_by(Lipid, Label) %>%
  get_summary_stats(type = 'full')
  
print(ds.rstatix)

The output:

skimr package

The skimr package contains skim() function which is an effective alternative to the summary() function. summary() is a base R function that allows summarizing a vector, data frame, output from ANOVA calculations, or a model, etc.

The skim() function can be used for computing descriptive statistics. It quickly provides an overview of data frames, handling all data types, and applying summary functions based on the column types in a data frame subjected to skim. The summary functions can be selected and customized using skim_with() function. More information about the skimr package can be found here:

And here:

First, we will show how to use the skim() function to prepare general overviews:

# Installation of the skimr package:
install.packages("skimr")

# Calling library:
library(skimr)

# Checking the information about skim() in documentation:
?skim()

# Using skim() to compute fast summary statistics for the 'data' tibble:
ds.skim <-
  data %>%
  group_by(Label) %>%
  skim()
  
# Print results:
print(ds.skim)

A part of the output from the basic skim() function:

The skimr package also enables generating a fully customized skim via skim_with() function. The skim_with() returns a function we will use instead of the default skim(). We will store it as 'new_skim()'. First, we must define what data types will be subjected to what skim. In our case, we will define a skim to be performed on numeric columns only. Then, for these columns, using sfl() helper - what statistical parameters should be computed and via what functions. In the sfl() helper, we can use default functions (e.g., mean, median, sd, iqr, ...), functions from other packages (e.g., moments::kurtosis), or functions defined by us in R (e.g., ~range(.)). This way, our skim can be fully customized to compute any descriptive statistics we need. Using 'append' set to TRUE - we can add new parameters from sfl() to default ones computed via skim(), or by setting FALSE - we can create a skim containing only our new parameters from sfl(). The summary of skim_with function is presented below:

The application of skim_with() is presented below:

# Customizing skim - first read about the skim_with() function:
?skim_with()

# Install library 'parameters' to compute skewness and kurtosis:
install.packages("parameters")

# Install library 'moments' to compute sample moment:
install.packages("moments")

# Define an additional function for computing range:
range <- function(x) {max(x)-min(x)}

# Customize your skim:
new_skim <- skim_with(
  numeric = sfl(iqr = IQR, # *
                mad = mad, # *
                Q1 = ~quantile(., 0.25), # *
                median = median, # *
                Q3 = ~quantile(., 0.75), # *
                range = ~range(.), # **
                skewness = parameters::skewness,  # ***
                kurtosis.comp = parameters::kurtosis, # ***
                moment = moments::moment), # ***
  append = FALSE        
) 

# Explanations:
# * 
# iqr, mad, Q1, median, Q3 will be computed using base R functions.
# **
# range will be computed using a function defined in R and stored in the global environment.
# ***
# skewness, kurtosis, moment will be computed using libraries parameters (s,k), and moments (m).

# Apply new_skim() to skim out 'data' tibble:
data %>%
  group_by(Label) %>%
  new_skim()

The output:

dplyr package

The dplyr package provides a similar function to select(), filer(), mutate(), arrange(), slice(), and group_by(). It is called summarise(). You should be familiar with all of them at the beginning of the R application for -omics data analysis and visualization. Using summarise(), we can quickly prepare a tibble with descriptive statistics. More about summarise() here:

The statistical parameters are computed based on the functions supplied to summarise() in its arguments. To compute complete descriptive statistics, we will first transform a wide 'data' tibble into a long tibble with 4 columns (Sample Name, Label, Lipid, Concentration), then, our data will be grouped by Lipid and Label (biological group), and finally, we will pipe the data to the summarise() function to compute descriptive statistics using the Concentration column. Functions for computing descriptive statistics in summarise() can be R base functions, functions defined and stored in the global environment, or functions obtained from other packages (similarly to skim_with):

# Calling tidyverse collection:
library(tidyverse)

# Defining an additional function for computing a range of concentrations:
range <- function(x) {max(x)-min(x)}

# The application of summarise() function to compute descriptive statistics:
data %>% 
  pivot_longer(cols = `CE 16:1`:`SM 42:1;O2`, 
               names_to = 'Lipid', 
               values_to = 'Concentration') %>% 
  group_by(Lipid, Label) %>%
  summarise(mean = mean(Concentration),
            sd = sd(Concentration),
            min = min(Concentration),
            max = max(Concentration),
            range = range(Concentration),
            mad = mad (Concentration), 
            median = median(Concentration),
            iqr = IQR(Concentration),
            Q1 = quantile(Concentration, 0.25),
            Q2 = quantile(Concentration, 0.5),
            Q3 = quantile(Concentration, 0.75),
            skewness = parameters::skewness(Concentration),
            kurtosis.comp = parameters::kurtosis(Concentration),
            moment = moments::moment(Concentration))

The output:

A tibble with descriptive statistics can be exported from R, e.g. as a .csv file:

# Saving the tibble with descriptive statistics in the working directory.
# Storing output from the summarise() function as 'ds.summarise':
data %>% 
  pivot_longer(cols = `CE 16:1`:`SM 42:1;O2`, 
               names_to = 'Lipid', 
               values_to = 'Concentration') %>% 
  group_by(Lipid, Label) %>%
  summarise(mean = mean(Concentration),
            sd = sd(Concentration),
            min = min(Concentration),
            max = max(Concentration),
            range = range(Concentration),
            mad = mad (Concentration), 
            median = median(Concentration),
            iqr = IQR(Concentration),
            Q1 = quantile(Concentration, 0.25),
            Q2 = quantile(Concentration, 0.5),
            Q3 = quantile(Concentration, 0.75),
            skewness = parameters::skewness(Concentration),
            kurtosis.comp = parameters::kurtosis(Concentration),
            moment = moments::moment(Concentration)) -> ds.summarise

# Saving 'ds.summarise' as a .csv file in the working directory:
write.csv(ds.summarise, "Descriptive_statistics_lipidomics_PDAC.csv")
Tools for Data Diagnosis, Exploration, Transformation
The dlookr package overview.
The Personality Project's Guide to R
The information about the psych package.
Introduction to summarytools
The summarytools package at CRAN.
Logo
Pipe-Friendly Framework for Basic Statistical Tests
Introduction to the rstatix libarary.
Skim a data frame, getting useful summary statistics — skim
Introduction to the skimr package.
Using Skimr
Using the skimr package (CRAN).
Logo
348KB
Lipidomics_dataset.xlsx
This PDAC data set is used for all computations and visualizations throughout the GitBook. Use it in R for descriptive statistics computations and introduce it as data.
Descriptive statistics computed using describe() function from the dlookr package.
Customization of descriptive statistics calculations in describe() function from the dlookr package.
Descriptive statistics calculations for each biological group separately via describe() function from the dlookr package.
General descriptive statistics via describe() function from the psych package.
Descriptive statistics computed for biological groups via the describeBy() function from the psych package.
Computing descriptive statistics via descr() function from the summarytools package.
Descriptive statistics computed using get_summary_stats() function from the rstatix library.
Comarison: A/ the base R summary() function; B/ skim() from the skimr package.
Descriptive statistics (summary statistics) computed via basic skim() function from the skimr package.
The summary presenting application of the skim_with() function for creating new_skim() function.
Using customized skim - new_skim() - created via skim_with() function (from skimr() package).
A tibble filled with descriptive statistics computed for each biological group - obtained via summarise() from the dplyr library, base R functions, range() function defined by us in R, skewness() and kurtosis() functions from the parameters library, and moment() function from the moments package.
Logo