💪
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  1. PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R

Preferred formats in metabolomics and lipidomics analysis

A part of preparing data for analysis and visualization in OMICs analysis

PreviousLoading data into RNextPreprocess data type using Tidyverse package

Last updated 4 months ago

Two types of objects can be created in R while introducing lipidomics/metabolomics data: classic data frames and tibbles. The data frames were partially explained in the first subchapter. Here, you will find the rest of the basic information.

IMPORTANT: In tidy data frames, one column represents one variable (a feature, lipid concentration, metabolite concentration, gender, age, tumor grade, smoking status, etc.), and every row represents one observation (one patient for whom all variables are collected in columns). Values are stored in cells.

Tibbles are nothing else, but modern data frames in R. Tibbles retain the most useful features of data frames tested over time and drop the currently redundant and irritating features. Tibbles are handled in R via tidyverse collection - precisely via the tibble package. Both will be introduced here. All functionalities of tibbles are summarized on the CRAN repository and the tidyverse project website. You will also find a simple comparison of tibbles and data frames with examples in the links below:

As tibbles are more user-friendly than data frames and offer new important functionalities, we will use them whenever it is possible in this Gitbook. However, some plotting libraries still do not accept tibbles and require data frames.

To use tibbles, you will first need to install the tidyverse collection. Remember, the installation of all packages is performed only once. However, loading the library must be repeated every time after you start your RStudio. We will also install the tidymodels collection in the same line of code. Here are the commands we can use:

# Install tidyverse and tidymodels packages at once
install.packages(c("tidyverse", "tidymodels"))

# or create an object containing the name of packages
packages <- c("tidyverse", "tidymodels")
install.packages(packages)

# Install tidyverse and tidymodels one by one if you don't feel confident enough
# First install tidyverse
install.packages("tidyverse")

# Next, install tidymodels
install.packages("tidymodels")

# Call tidyverse library
library(tidyverse)

Now, we can test whether the object created by the read_xlsx() function while reading data into R is a classic data frame or tibble. The tibble library delivers function is_tibble(). If the function returns TRUE, the object is the tibble; otherwise, FALSE is returned for other objects and classic data frames.

# Testing if an object is a classic data frame or tibble
is_tibble(data)

As the function returns TRUE, the tibble was created by read_xlsx(). We can print it. Type:

# Printing tibble (with default 10 rows)
print(data)

# or printing tibble with a selected number of rows
print(data, n = 20)

This way, we obtain in the console the following output:

Now, let's change the 'data' object into the data frame and apply the is_tibble() function:

# Changing object type from tibble into data frame
data <- as.data.frame(data)

# Checking if 'data' is still a tibble
is_tibble(data)

Now, the is_tibble() function produces FALSE as the 'data' is no longer tibble. We can print it using:

# Print data frame
print(data)

The classic data frame can be turned into tibble using the command as_tibble():

# Changing data frame into a tibble
data <- as_tibble(data)

# Testing if the 'data' object is the tibble again
is_tibble(data)

As you can see in the examples provided above, tibbles are well-arranged after printing, allowing for simpler data inspection compared to a classic data frame. Additionally, the information on the object type and its size is available, as well as the data type in each column (right below the column name). This is missing after printing data frames, and additional functions have to be used in this case, e.g., the str():

# Checking data type for all columns in a data frame
str(data, list.len = 129)

# The argument list.len can be used to indicate no of variables of interest
# As we had overall 129 columns (variables) - we want to check them all

Alternatively, a glimpse() function from the package pillar and re-exported by dplyr can be used:

# Checking data type for every column via glimpse() from pillar package
glimpse(data)

All functionalities of tibbles, summarized in their 'laziness' and 'certainty', make them a safer and more suitable solution for beginners than classic data frames used previously in R.

Now, the new information for you is probably the so-called 'data type stored in columns', which is suggested by our tibble in the form of <chr> or <dbl>. The next subchapters will shortly present the data types in R.

The script containing all commands from this subchapter can be found here:

In summary, the tibbles package provides a good overview of the data structure and a format easily processed by functions for hypothesis testing, visualization, and machine learning, so-called tibbles. These features make tibbles particularly useful for metabolomics and lipidomics analysis.

Tibbles
Information about tibbles - the CRAN repository.
Logo
Simple Data Frames
Information about tibbles and comparison of tibbles and classic data frames - the tidyverse project website.
994B
Data frames vs tibbles.R
Script containing all commands from this subchapter.
Output of print() for a 'data' tibble.
Output of print() for 'data' data frame
Logo