💪
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  • Logarithmic transformation
  • Square-root-transformation
  • Mean-centering data in R
  • Data scaling in R
  1. Data transformation, scaling, and normalization in R
  2. Data transformation and scaling using different available R packages

Data transformation and scaling using mutate()

A part of data transformation & normalization – using available R packages

PreviousData transformation and scaling using different available R packagesNextData transformation and scaling using recipes R package

Last updated 1 year ago

The mutate() function offers high flexibility in modifying the content of tibbles. We will use it here for:

  1. Log-transformation,

  2. And square-root-transformation of the lipidomics data set.

Moreover, we will show you how to center and scale the data set in the next step. We will present the following scaling methods here:

  1. Autoscaling (also known as Unit-Variance-Scaling or UV-Scaling),

  2. Pareto scaling,

  3. Rage scaling,

  4. Vast scaling,

  5. Level scaling.

We again strongly recommend reading the following manuscript by Robert A. van den Berg et al.:

The manuscript presents data centering, scaling, and transformation for metabolomics data, including theoretical aspects and consequences of these operations. Here, we will rely on this work while preparing the functions, which enable data transformation, centering, and scaling.

We need to call the tidyverse collection to use mutate() function and pipes:

# Calling library
library(tidyverse)

Logarithmic transformation

Let's begin with a popular logarithmic transformation. Load again the complete data set into R as 'data', check if the created object is tibble, and adjust column types if necessary.

The log transformation can be performed in one of the two following ways:

# Log-transformation (with log10) of all numeric columns in the data set:
# Option no. 1 - using mutate():
data.log10.transformed <-
  data %>%
  mutate(across(where(is.numeric), log10))
  
# Explanations:
# Take 'data' tibble from the global environment,
# Push it through the pipe to the mutate() function,
# Mutate using log10() across all columns which return TRUE for is.numeric,
# Store the output as 'data.log10.transformed'.

  
# Option no. 2 - using mutate_if():
data.log10.transformed.2 <-
  data %>%
  mutate_if(is.numeric, log10)
  
# Explanations:
# Take 'data' tibble from the global environment,
# Push it through the pipe to the mutate_if() function,
# Mutate using log10() across all columns returning TRUE for is.numeric,
# Store the output as 'data.log10.transformed.2'.

print(data.log10.transformed.2)

Suppose one would like to use a different logarithm base for this transformation. It can be achieved through an easy modification of the code above:

# Log-transformation of all numeric columns in the data
# Version 1: Using the natural logarithm:
data.ln.transformed <-
  data %>%
  mutate(across(where(is.numeric), log))

print(data.ln.transformed)
  
# log() in R has the Euler's number in the base. 
# Euler's number is represented in R by the formula exp(1).

# Version 2: Using the logarithm with a base of 2:
data.log2.transformed <-
  data %>%
  mutate(across(where(is.numeric), log2))

print(data.log2.transformed)
  
# Version 3: Using the logarithm with a selected base, e.g. 5:
data.log5.transformed <-
  data %>%
  mutate(across(where(is.numeric), ~log(., 5))
  
# OR 
data.log5.transformed <-
  data %>%
  mutate(across(where(is.numeric), ~log(.x, 5))
  
# OR
data.log5.transformed <-
  data %>%
  mutate(across(where(is.numeric), ~log(..1, 5))

print(data.log5.transformed)

And the output:

Square-root-transformation

A simple change in the code from above enables performing square-root transformation. For the square-root-transformation, we can apply sqrt() function, or create a function x^0.5:

# Square-root-transformation of all numeric columns in the data set:
data.sqrt.transformed <-
  data %>%
  mutate(across(where(is.numeric), sqrt))
  
# OR

data.sqrt.transformed.2 <-
  data %>%
  mutate_if(is.numeric, sqrt)
  
# OR
data.sqrt.transformed.3 <-
  data %>%
  mutate_if(is.numeric, ~.^0.5)
  
# OR 

data.sqrt.transformed.4 <-
  data %>%
  mutate_if(is.numeric, ~.x^0.5)
  
# OR

data.sqrt.transformed.4 <-
  data %>%
  mutate_if(is.numeric, ~..1^0.5)

print(data.sqrt.transformed)

All these lines lead to one output:

Mean-centering data in R

Centering is subtracting the column mean from every entry in this column. Centered columns have a mean equal to 0. It is worth knowing that data centering is hidden in almost every regular scaling method. Centering only can be easily performed via mutate_if() function:

# Centering lipidomics data set:
data.centered <-
  data %>%
  mutate_if(is.numeric, ~.-mean(.))
  
# Explanations:
# (.-mean(.)) or (.x-mean(.)) or (..1-mean(..1)) is our centring function.
# Take 'data' tibble from the global environment,
# Push it through the pipe to the mutate_if() function,
# Mutate using the centering function across all columns returning TRUE for is.numeric,
# Store the output as 'data.centered'.

# OR

data.centered <-
  data %>%
  mutate_if(is.numeric, ~.x-mean(.x))
  
# OR

data.centered <-
  data %>%
  mutate_if(is.numeric, ~..1-mean(..1))
  
print(data.centered)

In this way, we obtain the following output:

Now, we can test, if the centering of our data worked correctly. In the 'Missing values handling in R' chapter, we showed you the sapply() function which allowed for applying functions to every column of a tibble and returned a vector. We will now recalculate the mean of every column and round it to 10 decimal places using the following line of code:

# Recheck of the data centering:
centering.recheck <- sapply(data.centered[,-c(1:2)], function(x) round(mean(x),10)) 

# Explanations:
# Apply to every numeric column of the 'data.centered' the function: round(mean(x),10).
# To avoid applying the function to `Sample Name` and `Label`, we removed them.
# Store the obtained vector as 'centering.recheck' in the global environment.

print(centering.recheck)

And the output:

Additional remark: if we did not round the result, you would find out that the mean has a small value, which, in fact, is almost 0. That is because different programming languages have limited precision of calculations, and numerical errors are normal. This value is so small anyway that rounding to 10 or even 15 decimal places still results in 0.

Data scaling in R

We will again apply mutate_if() function for the data scaling. Additionally, we will define the scaling functions separately:

# Data scaling functions:
# 1. Autoscaling (UV-scaling):
Autoscaling <- function(x) {(x-mean(x))/sd(x)}

# 2. Pareto scaling:
Pareto.scaling <- function(x) {(x-mean(x))/sqrt(sd(x))}

# 3. Range scaling:
Range.scaling <- function(x) {(x-mean(x))/(max(x)-min(x))}

# 4. Level scaling:
Level.scaling <- function(x) {(x-mean(x))/mean(x)}

# 5. Vast scaling:
Vast.scaling <- function(x) {mean(x)*(x-mean(x))/sd(x)^2}

After executing these lines of code, the scaling functions will appear in the global environment under 'Functions'. We are ready to scale the data set. We will scale the previously log-transformed data:

# Data scaling in R:
# 1. Autoscaling (UV-scaling):
data.UV.scaled <-
  data.log10.transformed %>%
  mutate_if(is.numeric, ~Autoscaling(.))
  
# Explanations:
# Take 'data.log10.transformed' tibble from the global environment,
# Push it through the pipe to the mutate_if() function,
# Mutate using the UV-scaling function across all columns returning TRUE for is.numeric,
# Store the output as 'data.UV.scaled'.
  
print(data.UV.scaled)

The output:

# Data scaling in R:
# 2. Pareto scaling:
data.Pareto.scaled <-
  data.log10.transformed %>%
  mutate_if(is.numeric, ~Pareto.scaling(.))
  
# Explanations:
# Take 'data.log10.transformed' tibble from the global environment,
# Push it through the pipe to the mutate_if() function,
# Mutate using the Pareto scaling function across all columns returning TRUE for is.numeric,
# Store the output as 'data.Pareto.scaled'.
  
print(data.Pareto.scaled)

The output:

# Data scaling in R:
# 3. Range scaling:
data.range.scaled <-
  data.log10.transformed %>%
  mutate_if(is.numeric, ~Range.scaling(.))
  
# Explanations:
# Take 'data.log10.transformed' tibble from the global environment,
# Push it through the pipe to the mutate_if() function,
# Mutate using the range scaling function across all columns returning TRUE for is.numeric,
# Store the output as 'data.range.scaled'.
  
print(data.range.scaled)

The output:

# Data scaling in R:
# 4. Level scaling:
data.level.scaled <-
  data.log10.transformed %>%
  mutate_if(is.numeric, ~Level.scaling(.))
  
# Explanations:
# Take 'data.log10.transformed' tibble from the global environment,
# Push it through the pipe to the mutate_if() function,
# Mutate using the level scaling function across all columns returning TRUE for is.numeric,
# Store the output as 'data.level.scaled'.
  
print(data.level.scaled)

The output:

# Data scaling in R:
# 5. Vast scaling:
data.vast.scaled <-
  data.log10.transformed %>%
  mutate_if(is.numeric, ~Vast.scaling(.))
  
# Explanations:
# Take 'data.log10.transformed' tibble from the global environment,
# Push it through the pipe to the mutate_if() function,
# Mutate using the vast scaling function across all columns returning TRUE for is.numeric,
# Store the output as 'data.vast.scaled'.
  
print(data.vast.scaled)

The output:

Centering, scaling, and transformations: improving the biological information content of metabolomics data - BMC GenomicsSpringerLink
Log10-transformation of the 'data' tibble using mutate() or mutate_if() functions.
Transformation of data using different logarithm base: Euler's number, 2, and 5.
Square-root transformation via mutate_if() and sqrt().
Data centering via mutate_if() and centering function: ~.-mean(.).
Data centering recheck. As you see, the mean values of every centered column is equal to 0.
Data autoscaling in R with mutate_if() and autoscaling function.
Data Pareto scaling in R with mutate_if() and Pareto scaling function.
Data Range scaling in R with mutate_if() and Range scaling function.
Data Level scaling in R with mutate_if() and Level scaling function.
Data Vast scaling in R with mutate_if() and Vast scaling function.
Logo