> For the complete documentation index, see [llms.txt](https://laboratory-of-lipid-metabolism-a.gitbook.io/omics-data-visualization-in-r-and-python/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://laboratory-of-lipid-metabolism-a.gitbook.io/omics-data-visualization-in-r-and-python/metabolites-and-lipids-descriptive-statistical-analysis-in-r/computing-descriptive-statistics-in-r.md).

# Computing descriptive statistics in R

R offers a variety of libraries for computing descriptive statistics like *central tendency measures (mean, median, mode, ...)* and *uncertainty* or *spread measures (standard deviation, standard error of the mean, interquartile range, ...)*. Here, we will present six selected packages that provide a wide range of descriptive statistics parameters via simple code. We advise you to analyze all solutions and select the most suitable one to begin with.

**Important:** Certain parameters (e.g.: *skewness* and *kurtosis*), can be defined in a different way across different packages. **Always carefully read** **the function documentation and vignette to understand in what way the summary statistics you are interested in are calculated!** It will be necessary to explain these computations in the materials and methods of your paper, in the section concerning statistical analysis. **Additionally, be sure that you also understand the meaning of the value obtained in the computations! Find additional information or check references suggested by authors of a package if the description of computations in documentation is too brief for you.**&#x20;

## Data set used for descriptive statistics computations

For all basic computations, we rely on the PDAC data set (called in R `data`), **which can be downloaded from the Introduction section (subchapter "Example data sets") or from the link below:**

{% file src="/files/zzjMEQue7WRa4M1DvL7t" %}
This PDAC data set is used for all computations and visualizations throughout the GitBook. Use it in R for descriptive statistics computations and introduce it as `data`.
{% endfile %}

## dlookr package

The dlookr package was discontinued on CRAN. However, it can still be downloaded from GitHub:

```r
# Installing dlookr package
install.packages(“devtools”)
devtools::install_github("choonghyunryu/dlookr")

# Check the vignette of this package: https://choonghyunryu.github.io/dlookr/
```

It can be applied for diagnosing data quality, imputation of missing entries, data transformation, and exploratory data analysis. The package overview is available here:

{% embed url="<https://choonghyunryu.github.io/dlookr/>" %}
The dlookr package overview.
{% endembed %}

The descriptive statistics can be computed using the describe() function:

```r
# Calling the library and tidyverse collection:
library(dlookr)
library(tidyverse)

# Documentation regarding describe() function:
?describe()

# Using the describe() function:
ds.dlookr.v1 <- describe(data)

print(ds.dlookr.v1)
```

This way, we use our wide tibble 'data' for computing descriptive statistics and we obtain this output:

<figure><img src="/files/RNoeDOwFV6a2OSrf8Var" alt=""><figcaption><p>Descriptive statistics computed using describe() function from the dlookr package.</p></figcaption></figure>

As you see, the simple function can provide you with the total number of samples (n), the number of missing values for every feature (na), mean, median (as p50), standard deviation (sd), standard error of the mean (se\_mean), interquartile range (IQR), measures of distribution asymmetry like skewness or how heavy are distribution tails ('tailedness') like kurtosis, and quantiles/percentages. The describe() function allows for modifying the content of the output tibble via statistics and quantiles:

```r
# Using the describe() function - modifications of the output:
ds.dlookr.v2 <- describe(data, 
                         statistics = c("mean", 
                                        "sd", 
                                        "skewness", 
                                        "kurtosis",
                                        "quantiles"),
                         quantiles = c(0, 0.25, 0.5,0.75,1)
                         ) 

print(ds.dlookr.v2)
```

Results in:

<figure><img src="/files/rQxZusUq0rlKNdCtX7cV" alt=""><figcaption><p>Customization of descriptive statistics calculations in describe() function from the dlookr package.</p></figcaption></figure>

However, all parameters we computed characterize features without considering biological groups. Usually, -omics scientists are more interested in descriptive statistics computed for each biological group separately. According to the describe() function documentation, we can also introduce a grouped tibble for computations (using group\_by() function from the tidyverse collection). It results in an output tibble presenting descriptive statistics for each biological group:

```r
# Computing descriptive statistics for each group via describe():
ds.dlookr.v3 <- 
  data %>% 
  group_by(Label) %>%
  describe()

print(ds.dlookr.v3)
```

The function computes now all parameters for each biological group separately:

<figure><img src="/files/tRP5bFBR7yqoITTDnZn0" alt=""><figcaption><p>Descriptive statistics calculations for each biological group separately via describe() function from the dlookr package.</p></figcaption></figure>

## psych package

The psych package was primarily developed for psychology-related sciences. It contains useful statistical tools, which can be applied more broadly than only to data collected within psychological studies. More information about the package can be found here:

{% embed url="<https://personality-project.org/r/psych/>" %}
The information about the psych package.
{% endembed %}

The psych package contains the function for computing descriptive statistics called the same as the dlookr package - describe():

```r
# Installing psych package:
install.packages("psych")

# Calling library
library(psych)

# Investigating package vignette:
?describe()

# Computing general descriptive statistics:
ds.psych.v1 <- describe(data)

print(ds.psych.v1)
```

The output:

<figure><img src="/files/vWGEIY7yq2fCwWccH8EB" alt=""><figcaption><p>General descriptive statistics via describe() function from the psych package.</p></figcaption></figure>

Among parameters obtained via describe() function from the psych package are: the total number of observations (n), mean, standard deviation (sd), median, trimmed mean (mean calculated after removing a part of observations at the lower and upper end of the distribution), mean absolute deviation (mad), which is a measure of variability and represents the distance between all observations and their mean, minimum and maximum values (min, max), and their difference (range), skewness, kurtosis, and standard error (se) - which is a ratio of standard deviation and the square root of the total number of observations.

The describeBy() allows computing descriptive statistics for each biological group separately:

```r
# Checking documentation for the 'describeBy():
?describeBy()

# Application of the describeBy.
# Computing descriptive statistics for each biological group:
ds.psych.v2 <- describeBy(data, group = data$Label)

print(ds.psych.v2)
```

By specifying that our grouping variable is `Label` (group = data$Label), three separate data frames are computed for N, PAN, and T:

<figure><img src="/files/h0fkf7pi7VB7p6W1vwNz" alt=""><figcaption><p>Descriptive statistics computed for biological groups via the describeBy() function from the psych package.</p></figcaption></figure>

## summarytools package

The summarytools is a simple package providing functions for data exploration and reporting. It can be used for a fast recheck of descriptive statistics for a selected feature. For more information, check the "Introduction to summarytools" on the CRAN:

{% embed url="<https://cran.r-project.org/web/packages/summarytools/vignettes/introduction.html>" %}
The summarytools package at CRAN.
{% endembed %}

For computing descriptive statistics, the descr() function:&#x20;

```r
# Installation of the summarytools
install.packages("summarytools")

# Calling library
library(summarytools)

# Investigating the function of interest:
?descr()

# Computing descriptive statistics for each biological group:
ds.summarytools <- 
  data %>% 
  pivot_longer(cols = `CE 16:1`:`SM 42:1;O2`, 
               names_to = 'Lipids', 
               values_to = 'Concentration') %>% 
  group_by(Lipids, Label) %>%
  descr()
  
print(ds.summarytools$`Lipids = CE 16:0, Label = 1`)
print(ds.summarytools$`Lipids = CE 16:0, Label = 2`)
print(ds.summarytools$`Lipids = CE 16:0, Label = 3`)
```

The output:

<figure><img src="/files/STi4PyPsgB3hR1AQYLXO" alt=""><figcaption><p>Computing descriptive statistics via descr() function from the summarytools package.</p></figcaption></figure>

## rstatix package

The rstatix package is an excellent and complete solution for a brief data inspection and computing descriptive and univariate statistics of -omics data. The library provides functions fully compatible with the tidyverse collection. To learn more about the rstatix package visit:

{% embed url="<https://rpkgs.datanovia.com/rstatix/>" %}
Introduction to the rstatix libarary.
{% endembed %}

Here, we will frequently rely on the rstatix library because it provides tidy tibbles that can be further used for data visualization.

The get\_summary\_stats() function provides all necessary descriptive statistics to be reported in a publication, including mean, median, standard deviation, standard error of the mean, mean absolute deviation, interquartile range, Q1 and Q3, minimum and maximum values, and a 95% confidence interval of the mean. The output from the get\_summary\_stats() can be adjusted via the "type" argument, to be, for example: "mean\_sd" or "median\_iqr" only. We will compute "full" descriptive statistics offered by this package. For computing the descriptive statistics for all features, we can either use a wide tibble or a long tibble (via pivot\_longer()). In both cases, we need to group entries - in the case of the wide table by the `Label` column, and for the long tibble: first by feature (`Lipid` column), and next by biological group (`Label` column):

```r
# Installation of rstatix package:
install.packages("rstatix")

# Calling the libary:
library(rstatix)

# Checking the function of interest in the vignette:
?get_summary_stats()

# Computing descriptive statistics for each biological group via wide tibble:
ds.rstatix <-
  data %>% 
  group_by(Label) %>% 
  get_summary_stats() %>% 
  print(n=400)

# Computing descriptive statistics for each biological group via long tibble:
ds.rstatix <- 
  data %>% 
  pivot_longer(cols = `CE 16:1`:`SM 42:1;O2`, 
               names_to = 'Lipid', 
               values_to = 'Concentration') %>% 
  group_by(Lipid, Label) %>%
  get_summary_stats(type = 'full')
  
print(ds.rstatix)
```

The output:

<figure><img src="/files/ssi5wXGLYHPv3K543hoE" alt=""><figcaption><p>Descriptive statistics computed using get_summary_stats() function from the rstatix library.</p></figcaption></figure>

## skimr package

The skimr package contains skim() function which is an effective alternative to the summary() function. summary() is a base R function that allows summarizing a vector, data frame, output from ANOVA calculations, or a model, etc.&#x20;

<figure><img src="/files/gvDKWF5KlaQxHXgtAQUy" alt=""><figcaption><p>Comarison: A/ the base R summary() function; B/ skim() from the skimr package.</p></figcaption></figure>

The skim() function can be used for computing descriptive statistics. It quickly provides an overview of data frames, handling all data types, and applying summary functions based on the column types in a data frame subjected to skim. The summary functions can be selected and customized using skim\_with() function. More information about the skimr package can be found here:

{% embed url="<https://docs.ropensci.org/skimr/reference/skim.html>" %}
Introduction to the skimr package.
{% endembed %}

And here:

{% embed url="<https://cran.r-project.org/web/packages/skimr/vignettes/skimr.html>" %}
Using the skimr package (CRAN).
{% endembed %}

First, we will show how to use the skim() function to prepare general overviews:

```r
# Installation of the skimr package:
install.packages("skimr")

# Calling library:
library(skimr)

# Checking the information about skim() in documentation:
?skim()

# Using skim() to compute fast summary statistics for the 'data' tibble:
ds.skim <-
  data %>%
  group_by(Label) %>%
  skim()
  
# Print results:
print(ds.skim)
```

A part of the output from the basic skim() function:

<figure><img src="/files/1XXPOEFzFwHButNuugft" alt=""><figcaption><p>Descriptive statistics (summary statistics) computed via basic skim() function from the skimr package.</p></figcaption></figure>

The skimr package also enables generating a fully customized skim via skim\_with() function. The skim\_with() returns a function we will use instead of the default skim(). We will store it as 'new\_skim()'. First, we must define what data types will be subjected to what skim. In our case, we will define a skim to be performed on numeric columns only. Then, for these columns, using sfl() helper - what statistical parameters should be computed and via what functions. In the sfl() helper, we can use default functions (e.g., mean, median, sd, iqr, ...), functions from other packages (e.g., moments::kurtosis), or functions defined by us in R (e.g., \~range(.)). This way, our skim can be fully customized to compute any descriptive statistics we need. Using 'append' set to TRUE - we can add new parameters from sfl() to default ones computed via skim(), or by setting FALSE - we can create a skim containing only our new parameters from sfl(). The summary of skim\_with function is presented below:

<figure><img src="/files/hbJdYBwo7CXShIYPI8VB" alt=""><figcaption><p>The summary presenting application of the skim_with() function for creating new_skim() function.</p></figcaption></figure>

The application of skim\_with() is presented below:

```r
# Customizing skim - first read about the skim_with() function:
?skim_with()

# Install library 'parameters' to compute skewness and kurtosis:
install.packages("parameters")

# Install library 'moments' to compute sample moment:
install.packages("moments")

# Define an additional function for computing range:
range <- function(x) {max(x)-min(x)}

# Customize your skim:
new_skim <- skim_with(
  numeric = sfl(iqr = IQR, # *
                mad = mad, # *
                Q1 = ~quantile(., 0.25), # *
                median = median, # *
                Q3 = ~quantile(., 0.75), # *
                range = ~range(.), # **
                skewness = parameters::skewness,  # ***
                kurtosis.comp = parameters::kurtosis, # ***
                moment = moments::moment), # ***
  append = FALSE        
) 

# Explanations:
# * 
# iqr, mad, Q1, median, Q3 will be computed using base R functions.
# **
# range will be computed using a function defined in R and stored in the global environment.
# ***
# skewness, kurtosis, moment will be computed using libraries parameters (s,k), and moments (m).

# Apply new_skim() to skim out 'data' tibble:
data %>%
  group_by(Label) %>%
  new_skim()
```

The output:

<figure><img src="/files/tPBYLHO6Ko7c6fKZXCeW" alt=""><figcaption><p>Using customized skim - new_skim() - created via skim_with() function (from skimr() package).</p></figcaption></figure>

## dplyr package

The dplyr package provides a similar function to select(), filer(), mutate(), arrange(), slice(), and group\_by(). It is called summarise(). You should be familiar with all of them at the beginning of the R application for -omics data analysis and visualization. Using summarise(), we can quickly prepare a tibble with descriptive statistics. More about summarise() here:

The statistical parameters are computed based on the functions supplied to summarise() in its arguments. To compute complete descriptive statistics, we will first transform a wide 'data' tibble into a long tibble with 4 columns (`Sample Name`, `Label`, `Lipid`, `Concentration`), then, our data will be grouped by `Lipid` and `Label` (biological group), and finally, we will pipe the data to the summarise() function to compute descriptive statistics using the `Concentration` column. Functions for computing descriptive statistics in summarise() can be R base functions, functions defined and stored in the global environment, or functions obtained from other packages (similarly to skim\_with):

```r
# Calling tidyverse collection:
library(tidyverse)

# Defining an additional function for computing a range of concentrations:
range <- function(x) {max(x)-min(x)}

# The application of summarise() function to compute descriptive statistics:
data %>% 
  pivot_longer(cols = `CE 16:1`:`SM 42:1;O2`, 
               names_to = 'Lipid', 
               values_to = 'Concentration') %>% 
  group_by(Lipid, Label) %>%
  summarise(mean = mean(Concentration),
            sd = sd(Concentration),
            min = min(Concentration),
            max = max(Concentration),
            range = range(Concentration),
            mad = mad (Concentration), 
            median = median(Concentration),
            iqr = IQR(Concentration),
            Q1 = quantile(Concentration, 0.25),
            Q2 = quantile(Concentration, 0.5),
            Q3 = quantile(Concentration, 0.75),
            skewness = parameters::skewness(Concentration),
            kurtosis.comp = parameters::kurtosis(Concentration),
            moment = moments::moment(Concentration))
```

The output:

<figure><img src="/files/r0duipNm4j1UwAEOEv62" alt=""><figcaption><p>A tibble filled with descriptive statistics computed for each biological group - obtained via summarise() from the dplyr library, base R functions, range() function defined by us in R, skewness() and kurtosis() functions from the parameters library, and moment() function from the moments package.</p></figcaption></figure>

A tibble with descriptive statistics can be exported from R, e.g. as a .csv file:

```r
# Saving the tibble with descriptive statistics in the working directory.
# Storing output from the summarise() function as 'ds.summarise':
data %>% 
  pivot_longer(cols = `CE 16:1`:`SM 42:1;O2`, 
               names_to = 'Lipid', 
               values_to = 'Concentration') %>% 
  group_by(Lipid, Label) %>%
  summarise(mean = mean(Concentration),
            sd = sd(Concentration),
            min = min(Concentration),
            max = max(Concentration),
            range = range(Concentration),
            mad = mad (Concentration), 
            median = median(Concentration),
            iqr = IQR(Concentration),
            Q1 = quantile(Concentration, 0.25),
            Q2 = quantile(Concentration, 0.5),
            Q3 = quantile(Concentration, 0.75),
            skewness = parameters::skewness(Concentration),
            kurtosis.comp = parameters::kurtosis(Concentration),
            moment = moments::moment(Concentration)) -> ds.summarise

# Saving 'ds.summarise' as a .csv file in the working directory:
write.csv(ds.summarise, "Descriptive_statistics_lipidomics_PDAC.csv")
```
