Data transformation and scaling using mutate()

A part of data transformation & normalization – using available R packages

The mutate() function offers high flexibility in modifying the content of tibbles. We will use it here for:

  1. Log-transformation,

  2. And square-root-transformation of the lipidomics data set.

Moreover, we will show you how to center and scale the data set in the next step. We will present the following scaling methods here:

  1. Autoscaling (also known as Unit-Variance-Scaling or UV-Scaling),

  2. Pareto scaling,

  3. Rage scaling,

  4. Vast scaling,

  5. Level scaling.

We again strongly recommend reading the following manuscript by Robert A. van den Berg et al.:

The manuscript presents data centering, scaling, and transformation for metabolomics data, including theoretical aspects and consequences of these operations. Here, we will rely on this work while preparing the functions, which enable data transformation, centering, and scaling.

We need to call the tidyverse collection to use mutate() function and pipes:

Logarithmic transformation

Let's begin with a popular logarithmic transformation. Load again the complete data set into R as 'data', check if the created object is tibble, and adjust column types if necessary.

The log transformation can be performed in one of the two following ways:

Log10-transformation of the 'data' tibble using mutate() or mutate_if() functions.

Suppose one would like to use a different logarithm base for this transformation. It can be achieved through an easy modification of the code above:

And the output:

Transformation of data using different logarithm base: Euler's number, 2, and 5.

Square-root-transformation

A simple change in the code from above enables performing square-root transformation. For the square-root-transformation, we can apply sqrt() function, or create a function x^0.5:

All these lines lead to one output:

Square-root transformation via mutate_if() and sqrt().

Mean-centering data in R

Centering is subtracting the column mean from every entry in this column. Centered columns have a mean equal to 0. It is worth knowing that data centering is hidden in almost every regular scaling method. Centering only can be easily performed via mutate_if() function:

In this way, we obtain the following output:

Data centering via mutate_if() and centering function: ~.-mean(.).

Now, we can test, if the centering of our data worked correctly. In the 'Missing values handling in R' chapter, we showed you the sapply() function which allowed for applying functions to every column of a tibble and returned a vector. We will now recalculate the mean of every column and round it to 10 decimal places using the following line of code:

And the output:

Data centering recheck. As you see, the mean values of every centered column is equal to 0.

Additional remark: if we did not round the result, you would find out that the mean has a small value, which, in fact, is almost 0. That is because different programming languages have limited precision of calculations, and numerical errors are normal. This value is so small anyway that rounding to 10 or even 15 decimal places still results in 0.

Data scaling in R

We will again apply mutate_if() function for the data scaling. Additionally, we will define the scaling functions separately:

After executing these lines of code, the scaling functions will appear in the global environment under 'Functions'. We are ready to scale the data set. We will scale the previously log-transformed data:

The output:

Data autoscaling in R with mutate_if() and autoscaling function.

The output:

Data Pareto scaling in R with mutate_if() and Pareto scaling function.

The output:

Data Range scaling in R with mutate_if() and Range scaling function.

The output:

Data Level scaling in R with mutate_if() and Level scaling function.

The output:

Data Vast scaling in R with mutate_if() and Vast scaling function.

Last updated