Data Normalization – bestNormalize R package

A part of data transformation & normalization – using available R packages

The most important part of the metabolic analysis is the normalization of the data - and the selection of the most suitable method for our type of data. The bestNormalize\R package (https://cran.r-project.org/web/packages/bestNormalize/vignettes/bestNormalize.html) has been created to facilitate the selection of a correct normalization. The risk of detecting false positive results increases in the effect of a wrong normalization.

The bestNormalize package provides researchers with a variety of tools and algorithms to assist in the selection of the most suitable normalization method, customized for their specific type of measured data. The package is designed with the goal of offering a user-friendly interface, simplifying the often intricate task of normalization, and enhancing both the accessibility and efficiency of the process.

NOTE: Let's follow these principles - which also outline this GitBook:

1) first, install packages,

2) load the required libraries,

3) set the working directory (wd),

4) load the data into R,

5) and perform your analysis.

Firstly, install and load the package/library:

# Installation (once only)
install.packages('bestNormalize')

Secondly, set the working directory and load data (see chapter ‘Loading data into R’ ), and perform

analysis:

# Library:
library(bestNormalize)

# Set the working directory:
setwd("...")

# Load the data into R:
data <- read.csv(file.choose())

# Find the optimal transformation using repeated cross-validation for our input dataset:
bestNormalize(data$CE.16.1, allow_lambert_s = TRUE)

The parameter allow_lambert_s = TRUE in the bestNormalize function indicates whether Lambert's W transformation with type 's' should be considered as a potential normalization method during the estimation process. When set to TRUE, it allows the algorithm to evaluate Lambert's W transformation with type 's' as one of the candidate transformations to determine the best-fit normalization for the given data.

Keep in mind that the transformation is performed individually for each metabolite/lipid, and the input must be a vector rather than the entire matrix.

The output:

The output from bestNormalize(data$CE.16.1, allow_lambert_s = TRUE) provides information about the best normalization transformation chosen for the specified variable (data$CE.16.1). Here's a breakdown of the output:

1. Estimated Normality Statistics:

The output lists various transformation methods along with their estimated normality statistics (Pearson P / df). Lower values indicate a more normal distribution. Methods include arcsinh(x), Box-Cox, Center+scale, Double Reversed Log_b(x+a), Lambert's W (type s), Log_b(x+a), orderNorm (ORQ), sqrt(x + a), and Yeo-Johnson.

2. Estimation Method:

The estimation method used is mentioned: Out-of-sample via Cross-Validation (CV) with 10 folds and 5 repeats.

3. Best Chosen Transformation:

The final selection by bestNormalize is a Standardized arcsinh(x)

4. Transformation.

Relevant statistics before standardization are provided: mean = 6.980617 and standard deviation (sd) = 0.5271941.

This package helps confirm that we have selected a suitable method for normalizing our metabolomics/lipidomics data.

PreviousData transformation and scaling using recipes R package NextData Transformation and scaling in Python

Last updated 1 year ago