Computing descriptive statistics in R
Metabolites and lipids descriptive statistical analysis in R
Last updated
Metabolites and lipids descriptive statistical analysis in R
Last updated
R offers a variety of libraries for computing descriptive statistics like central tendency measures (mean, median, mode, ...) and uncertainty or spread measures (standard deviation, standard error of the mean, interquartile range, ...). Here, we will present six selected packages that provide a wide range of descriptive statistics parameters via simple code. We advise you to analyze all solutions and select the most suitable one to begin with.
Important: Certain parameters (e.g.: skewness and kurtosis), can be defined in a different way across different packages. Always carefully read the function documentation and vignette to understand in what way the summary statistics you are interested in are calculated! It will be necessary to explain these computations in the materials and methods of your paper, in the section concerning statistical analysis. Additionally, be sure that you also understand the meaning of the value obtained in the computations! Find additional information or check references suggested by authors of a package if the description of computations in documentation is too brief for you.
For all basic computations, we rely on the PDAC data set (called in R data
), which can be downloaded from the Introduction section (subchapter "Example data sets") or from the link below:
The dlookr package was discontinued on CRAN. However, it can still be downloaded from GitHub:
It can be applied for diagnosing data quality, imputation of missing entries, data transformation, and exploratory data analysis. The package overview is available here:
The descriptive statistics can be computed using the describe() function:
This way, we use our wide tibble 'data' for computing descriptive statistics and we obtain this output:
As you see, the simple function can provide you with the total number of samples (n), the number of missing values for every feature (na), mean, median (as p50), standard deviation (sd), standard error of the mean (se_mean), interquartile range (IQR), measures of distribution asymmetry like skewness or how heavy are distribution tails ('tailedness') like kurtosis, and quantiles/percentages. The describe() function allows for modifying the content of the output tibble via statistics and quantiles:
Results in:
However, all parameters we computed characterize features without considering biological groups. Usually, -omics scientists are more interested in descriptive statistics computed for each biological group separately. According to the describe() function documentation, we can also introduce a grouped tibble for computations (using group_by() function from the tidyverse collection). It results in an output tibble presenting descriptive statistics for each biological group:
The function computes now all parameters for each biological group separately:
The psych package was primarily developed for psychology-related sciences. It contains useful statistical tools, which can be applied more broadly than only to data collected within psychological studies. More information about the package can be found here:
The psych package contains the function for computing descriptive statistics called the same as the dlookr package - describe():
The output:
Among parameters obtained via describe() function from the psych package are: the total number of observations (n), mean, standard deviation (sd), median, trimmed mean (mean calculated after removing a part of observations at the lower and upper end of the distribution), mean absolute deviation (mad), which is a measure of variability and represents the distance between all observations and their mean, minimum and maximum values (min, max), and their difference (range), skewness, kurtosis, and standard error (se) - which is a ratio of standard deviation and the square root of the total number of observations.
The describeBy() allows computing descriptive statistics for each biological group separately:
By specifying that our grouping variable is Label
(group = data$Label), three separate data frames are computed for N, PAN, and T:
The summarytools is a simple package providing functions for data exploration and reporting. It can be used for a fast recheck of descriptive statistics for a selected feature. For more information, check the "Introduction to summarytools" on the CRAN:
For computing descriptive statistics, the descr() function:
The output:
The rstatix package is an excellent and complete solution for a brief data inspection and computing descriptive and univariate statistics of -omics data. The library provides functions fully compatible with the tidyverse collection. To learn more about the rstatix package visit:
Here, we will frequently rely on the rstatix library because it provides tidy tibbles that can be further used for data visualization.
The get_summary_stats() function provides all necessary descriptive statistics to be reported in a publication, including mean, median, standard deviation, standard error of the mean, mean absolute deviation, interquartile range, Q1 and Q3, minimum and maximum values, and a 95% confidence interval of the mean. The output from the get_summary_stats() can be adjusted via the "type" argument, to be, for example: "mean_sd" or "median_iqr" only. We will compute "full" descriptive statistics offered by this package. For computing the descriptive statistics for all features, we can either use a wide tibble or a long tibble (via pivot_longer()). In both cases, we need to group entries - in the case of the wide table by the Label
column, and for the long tibble: first by feature (Lipid
column), and next by biological group (Label
column):
The output:
The skimr package contains skim() function which is an effective alternative to the summary() function. summary() is a base R function that allows summarizing a vector, data frame, output from ANOVA calculations, or a model, etc.
The skim() function can be used for computing descriptive statistics. It quickly provides an overview of data frames, handling all data types, and applying summary functions based on the column types in a data frame subjected to skim. The summary functions can be selected and customized using skim_with() function. More information about the skimr package can be found here:
And here:
First, we will show how to use the skim() function to prepare general overviews:
A part of the output from the basic skim() function:
The skimr package also enables generating a fully customized skim via skim_with() function. The skim_with() returns a function we will use instead of the default skim(). We will store it as 'new_skim()'. First, we must define what data types will be subjected to what skim. In our case, we will define a skim to be performed on numeric columns only. Then, for these columns, using sfl() helper - what statistical parameters should be computed and via what functions. In the sfl() helper, we can use default functions (e.g., mean, median, sd, iqr, ...), functions from other packages (e.g., moments::kurtosis), or functions defined by us in R (e.g., ~range(.)). This way, our skim can be fully customized to compute any descriptive statistics we need. Using 'append' set to TRUE - we can add new parameters from sfl() to default ones computed via skim(), or by setting FALSE - we can create a skim containing only our new parameters from sfl(). The summary of skim_with function is presented below:
The application of skim_with() is presented below:
The output:
The dplyr package provides a similar function to select(), filer(), mutate(), arrange(), slice(), and group_by(). It is called summarise(). You should be familiar with all of them at the beginning of the R application for -omics data analysis and visualization. Using summarise(), we can quickly prepare a tibble with descriptive statistics. More about summarise() here:
The statistical parameters are computed based on the functions supplied to summarise() in its arguments. To compute complete descriptive statistics, we will first transform a wide 'data' tibble into a long tibble with 4 columns (Sample Name
, Label
, Lipid
, Concentration
), then, our data will be grouped by Lipid
and Label
(biological group), and finally, we will pipe the data to the summarise() function to compute descriptive statistics using the Concentration
column. Functions for computing descriptive statistics in summarise() can be R base functions, functions defined and stored in the global environment, or functions obtained from other packages (similarly to skim_with):
The output:
A tibble with descriptive statistics can be exported from R, e.g. as a .csv file:
data
.