💪
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  1. PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R

Preprocess data type using Tidyverse package

A part of preparing data for analysis and visualization in OMICs analysis

PreviousPreferred formats in metabolomics and lipidomics analysisNextUseful R tricks and features in OMICs mining

Last updated 4 months ago

As mentioned in 'Loading data into R', the read_xlsx() function contains the argument 'col_type' set by default to NULL. Therefore, data types in every column data are guessed. Before performing any analysis or visualization, it is necessary to recheck the data type in every column and adjust it accordingly, i.e., sample names should be characters (<chr>), grouping variables factors (<fct>), and concentrations - numeric variables (<dbl> or <num>).

For a single column, the data type can be examined using the following set of base R functions:

# Checking data type in a single column
is.character(data$`Sample Name`)

# If the function returns TRUE - this column is a character if FALSE - a different type

# Other examples:
is.factor(data$`Label`)
is.integer(data$`CE 16:0`)
is.numeric(data$`SM 41:1;O2`)
is.logical(data$`SM 41:1;O2`)
# etc.

The glimpse() function from the pillar package (the tidyverse collection - ) allows checking all column types in your lipidomics or metabolomics data at once. We mentioned it already in the 'Data frame or tibble?' subchapter. Take a look at the code below:

# Calling library
library(tidyverse)

# Inspection of data type for every column of our tibble:
glimpse(data)

# Alternatively:
# 1) you can run: str(data, n=129) 
# 2) if you work with tibble - print(tibble_name), e.g. print(data)

The function generates the following output:

A black frame highlights column types. The first column contains Sample Name (string of letters and numbers), and it was recognized correctly as a character vector <cht>. The Label (in red frame) column contains a grouping variable, which was guessed as a character vector. It will be necessary to change the data type stored in this column to factor <fct>. All lipid concentrations were correctly recognized as numeric vectors <dbl>.

To adjust the data type stored in a column, we will initially use base R functions: as.character(), as.factor(), as.numeric(), as.integer(), and as.logical(). Later, we will also apply mutate() function from the dplyr package (see chapter Useful R tricks and features in OMICs mining - Data wrangling syntaxes). Changing the variable type will also require introducing a new symbol - dollar ($). Using $, it is possible to access, add, delete, change, and update variables from lists or columns of a data frame. Changing the column type from character to factor can be achieved using the following line of code:

# Changing character into a factor (no library needed - base R function)
data$`Label` <- as.factor(data$`Label`)

In this line of code, we accessed the Label column and then changed it into a factor column. The `...` are called backtick signs. In the case of the Label column, the backticks are not necessary. However, all column names containing spaces, semicolons, and other signs that are not allowed in R column names should be referred to using the backticks signs. For the updated shorthand notations of lipids, it will be necessary to use backticks whenever referring to columns containing lipid concentrations.

More examples will also be shown in the next chapters of the Gitbook, but we would proceed similarly in the case of other columns, e.g.:

# Changing `CE 16:0` data type from numeric to integer:
data$`CE 16:0` <- as.integer(data$`CE 16:0`)

# Changing `Label` column back into the character vector
data$`Label` <- as.character(data$`Label`)

# etc. 

After executing the first line of code in this subchapter, the Label column type was changed from character to factor:

https://www.tidyverse.org/
The output of a glimpse() function for 'data' tibble. All column types are in the black frame, the Label column is highlighted in the red frame.
The 'data' tibble with updated column type for Label.