Data transformation and scaling using mutate()

A part of data transformation & normalization – using available R packages

PreviousData transformation and scaling using different available R packages NextData transformation and scaling using recipes R package

Last updated 1 year ago

Data transformation and scaling using mutate()

A part of data transformation & normalization – using available R packages

The mutate() function offers high flexibility in modifying the content of tibbles. We will use it here for:

Log-transformation,
And square-root-transformation of the lipidomics data set.

Moreover, we will show you how to center and scale the data set in the next step. We will present the following scaling methods here:

Autoscaling (also known as Unit-Variance-Scaling or UV-Scaling),
Pareto scaling,
Rage scaling,
Vast scaling,
Level scaling.

We again strongly recommend reading the following manuscript by Robert A. van den Berg et al.:

The manuscript presents data centering, scaling, and transformation for metabolomics data, including theoretical aspects and consequences of these operations. Here, we will rely on this work while preparing the functions, which enable data transformation, centering, and scaling.

We need to call the tidyverse collection to use mutate() function and pipes:

# Calling library
library(tidyverse)

Logarithmic transformation

Let's begin with a popular logarithmic transformation. Load again the complete data set into R as 'data', check if the created object is tibble, and adjust column types if necessary.

The log transformation can be performed in one of the two following ways:

# Log-transformation (with log10) of all numeric columns in the data set:
# Option no. 1 - using mutate():
data.log10.transformed <-
  data %>%
  mutate(across(where(is.numeric), log10))
  
# Explanations:
# Take 'data' tibble from the global environment,
# Push it through the pipe to the mutate() function,
# Mutate using log10() across all columns which return TRUE for is.numeric,
# Store the output as 'data.log10.transformed'.

  
# Option no. 2 - using mutate_if():
data.log10.transformed.2 <-
  data %>%
  mutate_if(is.numeric, log10)
  
# Explanations:
# Take 'data' tibble from the global environment,
# Push it through the pipe to the mutate_if() function,
# Mutate using log10() across all columns returning TRUE for is.numeric,
# Store the output as 'data.log10.transformed.2'.

print(data.log10.transformed.2)

Suppose one would like to use a different logarithm base for this transformation. It can be achieved through an easy modification of the code above:

# Log-transformation of all numeric columns in the data
# Version 1: Using the natural logarithm:
data.ln.transformed <-
  data %>%
  mutate(across(where(is.numeric), log))

print(data.ln.transformed)
  
# log() in R has the Euler's number in the base. 
# Euler's number is represented in R by the formula exp(1).

# Version 2: Using the logarithm with a base of 2:
data.log2.transformed <-
  data %>%
  mutate(across(where(is.numeric), log2))

print(data.log2.transformed)
  
# Version 3: Using the logarithm with a selected base, e.g. 5:
data.log5.transformed <-
  data %>%
  mutate(across(where(is.numeric), ~log(., 5))
  
# OR 
data.log5.transformed <-
  data %>%
  mutate(across(where(is.numeric), ~log(.x, 5))
  
# OR
data.log5.transformed <-
  data %>%
  mutate(across(where(is.numeric), ~log(..1, 5))

print(data.log5.transformed)

And the output:

Square-root-transformation

A simple change in the code from above enables performing square-root transformation. For the square-root-transformation, we can apply sqrt() function, or create a function x^0.5:

# Square-root-transformation of all numeric columns in the data set:
data.sqrt.transformed <-
  data %>%
  mutate(across(where(is.numeric), sqrt))
  
# OR

data.sqrt.transformed.2 <-
  data %>%
  mutate_if(is.numeric, sqrt)
  
# OR
data.sqrt.transformed.3 <-
  data %>%
  mutate_if(is.numeric, ~.^0.5)
  
# OR 

data.sqrt.transformed.4 <-
  data %>%
  mutate_if(is.numeric, ~.x^0.5)
  
# OR

data.sqrt.transformed.4 <-
  data %>%
  mutate_if(is.numeric, ~..1^0.5)

print(data.sqrt.transformed)

All these lines lead to one output:

Mean-centering data in R

Centering is subtracting the column mean from every entry in this column. Centered columns have a mean equal to 0. It is worth knowing that data centering is hidden in almost every regular scaling method. Centering only can be easily performed via mutate_if() function:

# Centering lipidomics data set:
data.centered <-
  data %>%
  mutate_if(is.numeric, ~.-mean(.))
  
# Explanations:
# (.-mean(.)) or (.x-mean(.)) or (..1-mean(..1)) is our centring function.
# Take 'data' tibble from the global environment,
# Push it through the pipe to the mutate_if() function,
# Mutate using the centering function across all columns returning TRUE for is.numeric,
# Store the output as 'data.centered'.

# OR

data.centered <-
  data %>%
  mutate_if(is.numeric, ~.x-mean(.x))
  
# OR

data.centered <-
  data %>%
  mutate_if(is.numeric, ~..1-mean(..1))
  
print(data.centered)

In this way, we obtain the following output:

Now, we can test, if the centering of our data worked correctly. In the 'Missing values handling in R' chapter, we showed you the sapply() function which allowed for applying functions to every column of a tibble and returned a vector. We will now recalculate the mean of every column and round it to 10 decimal places using the following line of code:

# Recheck of the data centering:
centering.recheck <- sapply(data.centered[,-c(1:2)], function(x) round(mean(x),10)) 

# Explanations:
# Apply to every numeric column of the 'data.centered' the function: round(mean(x),10).
# To avoid applying the function to `Sample Name` and `Label`, we removed them.
# Store the obtained vector as 'centering.recheck' in the global environment.

print(centering.recheck)

And the output:

Additional remark: if we did not round the result, you would find out that the mean has a small value, which, in fact, is almost 0. That is because different programming languages have limited precision of calculations, and numerical errors are normal. This value is so small anyway that rounding to 10 or even 15 decimal places still results in 0.

Data scaling in R

We will again apply mutate_if() function for the data scaling. Additionally, we will define the scaling functions separately:

# Data scaling functions:
# 1. Autoscaling (UV-scaling):
Autoscaling <- function(x) {(x-mean(x))/sd(x)}

# 2. Pareto scaling:
Pareto.scaling <- function(x) {(x-mean(x))/sqrt(sd(x))}

# 3. Range scaling:
Range.scaling <- function(x) {(x-mean(x))/(max(x)-min(x))}

# 4. Level scaling:
Level.scaling <- function(x) {(x-mean(x))/mean(x)}

# 5. Vast scaling:
Vast.scaling <- function(x) {mean(x)*(x-mean(x))/sd(x)^2}

After executing these lines of code, the scaling functions will appear in the global environment under 'Functions'. We are ready to scale the data set. We will scale the previously log-transformed data:

# Data scaling in R:
# 1. Autoscaling (UV-scaling):
data.UV.scaled <-
  data.log10.transformed %>%
  mutate_if(is.numeric, ~Autoscaling(.))
  
# Explanations:
# Take 'data.log10.transformed' tibble from the global environment,
# Push it through the pipe to the mutate_if() function,
# Mutate using the UV-scaling function across all columns returning TRUE for is.numeric,
# Store the output as 'data.UV.scaled'.
  
print(data.UV.scaled)

The output:

# Data scaling in R:
# 2. Pareto scaling:
data.Pareto.scaled <-
  data.log10.transformed %>%
  mutate_if(is.numeric, ~Pareto.scaling(.))
  
# Explanations:
# Take 'data.log10.transformed' tibble from the global environment,
# Push it through the pipe to the mutate_if() function,
# Mutate using the Pareto scaling function across all columns returning TRUE for is.numeric,
# Store the output as 'data.Pareto.scaled'.
  
print(data.Pareto.scaled)

The output:

# Data scaling in R:
# 3. Range scaling:
data.range.scaled <-
  data.log10.transformed %>%
  mutate_if(is.numeric, ~Range.scaling(.))
  
# Explanations:
# Take 'data.log10.transformed' tibble from the global environment,
# Push it through the pipe to the mutate_if() function,
# Mutate using the range scaling function across all columns returning TRUE for is.numeric,
# Store the output as 'data.range.scaled'.
  
print(data.range.scaled)

The output:

# Data scaling in R:
# 4. Level scaling:
data.level.scaled <-
  data.log10.transformed %>%
  mutate_if(is.numeric, ~Level.scaling(.))
  
# Explanations:
# Take 'data.log10.transformed' tibble from the global environment,
# Push it through the pipe to the mutate_if() function,
# Mutate using the level scaling function across all columns returning TRUE for is.numeric,
# Store the output as 'data.level.scaled'.
  
print(data.level.scaled)

The output:

# Data scaling in R:
# 5. Vast scaling:
data.vast.scaled <-
  data.log10.transformed %>%
  mutate_if(is.numeric, ~Vast.scaling(.))
  
# Explanations:
# Take 'data.log10.transformed' tibble from the global environment,
# Push it through the pipe to the mutate_if() function,
# Mutate using the vast scaling function across all columns returning TRUE for is.numeric,
# Store the output as 'data.vast.scaled'.
  
print(data.vast.scaled)

The output:

PreviousData transformation and scaling using different available R packages NextData transformation and scaling using recipes R package

Last updated 1 year ago