Data transformation and scaling - introduction
A part of data transformation & normalization
Data transformation
Two methods of data transformation are commonly used in lipidomics and metabolomics:
logarithm transformation (log transformation),
square root transformation (sqrt transformation).
Logarithm transformation
Log transformation is frequently applied in preprocessing metabolomics/lipidomics data. It is popular because many assume that it normalizes the distribution of variables or 'adjusts' the shape of a distribution to be more normal-like. Indeed, log-transformation decreases the data spread, minimizing the influence of outliers. In some cases, this can lead to more symmetric distribution shapes and correction of heteroscedasticity (difference in variance between biological groups). However, it is necessary to mention that in the effect of log transformation, the distribution does not automatically become Gaussian. Whether the shape is more symmetrical after the log-transformation, often requires additional verification. The log-transformation changes multiplicative relations into additive ones and constitutes a pseudo-scaling as high differences between concentrations shrink after log transformation. The log-transformation is problematic when the 'NA' values are substituted simply by '0' because log(0) is undefined.
Square root transformation
Except for log transformation, square root transformation (sqrt transformation) may be arbitrarily selected. The sqrt transformation, similarly to log transformation, corrects heteroscedasticity (difference in variance between biological groups), and constitutes pseudo-scaling, as high differences between lipid/metabolite concentrations shrink in the effect of square root transformation.
Data centering
Centering is a simple operation that relies on subtracting from every concentration in a column a mean of all concentrations measured for a feature:
- normalized concentration of i-th metabolite in j-th patient, - measured concentration of i-th metabolite in j-th patient, - mean concentration of i-th metabolite for all samples.
Centering is used when the focus is on extracting differences hidden in the data set. After centering, the mean of concentrations will be equal to 0.
Data scaling
The purpose of scaling is to normalize the scale of all features. Auto- and Pareto scaling are often applied to lipidomics and metabolomics data.
Autoscaling (also known as Uni-Variate scaling or UV-scaling) is given by the following formula:
- normalized concentration of i-th metabolite in j-th patient, - measured concentration of i-th metabolite in j-th patient, - mean concentration of i-th metabolite for all samples, - standard deviation of i-th metabolite across all samples.
All centered values are related to the standard deviation of all feature's concentrations in a column. After the Autoscaling, the standard deviation of each column is equal to 1.
Pareto scaling is similar to Autoscaling, and it is given by this formula:
As above: - normalized concentration of i-th metabolite in j-th patient, - measured concentration of i-th metabolite in j-th patient, - mean concentration of i-th metabolite for all samples, - standard deviation of i-th metabolite across all samples.
Pareto scaling is recommended for metabolomics in lipidomics data, considering its reducing effect on the relative importance of high values and keeping the data set intact.
Additional scaling methods involve:
Range scaling, given by the formula:
- normalized concentration of i-th metabolite in j-th patient, - measured concentration of i-th metabolite in j-th patient, - mean concentration of i-th metabolite for all samples, - max concentration of i-th metabolite across all samples, minimum concentration of i-th metabolite across all samples.
Level scaling:
- normalized concentration of i-th metabolite in j-th patient, - measured concentration of i-th metabolite in j-th patient, - mean concentration of i-th metabolite for all samples.
Vast scaling:
- normalized concentration of i-th metabolite in j-th patient, - measured concentration of i-th metabolite in j-th patient, - mean concentration of i-th metabolite for all samples, - standard deviation of i-th metabolite across all samples.
This information was prepared based on the following manuscript:
Transforming and Normalizing Your Data: Sequence of Events
As a beginner, one might wonder how to prepare a data frame for statistical analysis. Here, we will present an example of the sequence of events, starting from raw peak intensities or areas.
After processing raw mass spectrometry data, a table of intensities (or areas) is generated. Suppose the resulting data frame consists of signal intensities (areas) for each metabolite in the columns and the corresponding responses for each metabolite per patient in the rows. The figure below presents an example of a sequence of events leading to a ready data frame for statistical analysis and visualization. We will explain it below.
Steps 1 through 3 (from raw MS data to lipid/metabolite concentration tables)
The article and GitBook do not discuss processing raw mass spectrometry data or assigning lipid/metabolite identifications, considering previously published reviews.
Please refer to the following manuscripts as examples covering the steps from raw mass spectrometry data processing to obtaining peak intensity (area) tables, analyte identification, lipid (metabolite) concentration calculations, and quality checks.
The previous chapters also discuss the fundamentals of initial data normalization, including the example of signals' normalization against standards in R (please check Data normalization to the internal standards).
1) Lipidomics/Metabolomics data processing reviews and tools (selected tools/manuscripts):
2) Identification and annotation of lipids (metabolites):
3) Quality control, validation, lipid (metabolite) analysis/quantitation, analytical-variation-related normalizations (selected examples):
We also have the following recommendations if you are a beginner to lipid/metabolite MS data processing, normalization, and initial analysis:
DOs:
1) Always carefully read the manuals and select the correct processing settings for your chromatography and/or MS method.
2) Follow appropriate guidelines and recommendations for lipid and metabolite annotations, e.g., Lipidomics Standards Initiative Consortium, LipidMaps, Metabolomics Standards Initiative, and Metabolomics Society. Use the currently recommended shorthand notations for lipids and metabolites.
Support the identifications with retention patterns, e.g.:
ECN models in RP chromatography or standard retention pattern for lipid class separation methods. For more information, please see the references above.
Rely on mass spectrometry information, e.g.:
Behavior in ion mobility,
Accurate m/z annotation (FTICR, e.g., < 1-3 ppm, Q-Orbitrap or Q-TOF < 5 ppm, etc.)
Thorough MS/MS spectra analysis for specific fatty acyl or headgroup identifications.
Always verify software-based annotations. If in doubt, see the references above.
3) Remember about deisotoping. It is essential for obtaining more accurate lipid/metabolite concentrations in shotgun lipidomics, direct infusion approaches, and lipid-class chromatography separation methods like HILIC-MS or SFC-MS.
4) Always use analytical standards for quantification.
5) If you chose the pre-acquisition sample amount normalization method and it failed (evident from significantly varying MS responses across samples) or opted not to use it, you can always apply statistical normalization methods post-acquisition.
6) Consider applying response factors to obtain more accurate lipid/metabolite concentrations, particularly for methods relying on one-point calibration.
7) Carefully examine the data for potential isobaric overlaps, such as sodium-isobaric overlap in shotgun lipidomics and lipid-class-separation approaches.
8) Always carefully inspect the processed data and compare them with raw MS data, e.g., chromatograms, MS, and MS/MS spectra for inconsistencies.
9) Always consider submitting your raw data to the appropriate repository for transparency, e.g.:
10) Always consider placing the processed data in the supplementary materials of your article.
11) Perform at least simple validation of your analytical methods to ensure reproducibility, accuracy, and selectivity of standards, and learn about carry-over, the limit of quantitation (detection), response linearity ranges, matrix effects, etc.
12) If you have doubts, contact more experienced units and labs and use networking events to learn, train, and grow. Networking is an essential part of growing in lipidomics/metabolomics.
DON'Ts:
1) Never entirely rely on your data processing software. Double-check results after every step of MS data processing, and if necessary, correct software settings, repeat the data processing or introduce manual corrections.
2) Don't entirely trust software-based identifications. Always ensure that they are reasonable from a biological point of view. If in doubt, we encourage you to read the articles referenced above. Use the maximum available analytical information to assign annotations, including chromatography patterns, comparison to analytical standards, high-resolution MS1 information, MS/MS spectra, ion mobility information, etc.
3) Don't avoid the application of standards in lipidomics and metabolomics. This will lower the costs of your analysis but, simultaneously, the reliability of your findings if no normalization to standards is performed. Carefully select analytical standards, considering, e.g., unsaturation level, fatty acyl chain lengths, polarity, etc.
4) Don't make impulsive decisions regarding sample amount normalization! Carefully evaluate whether it is possible to use pre-acquisition normalization (which requires at least brief optimization) or post-acquisition normalization can only be applied (if pre-acquisition normalization is likely to fail or no suitable sample aliquot measure is available like volume or mass (because of strongly varying dilution of samples), creatinine (because of accompanying disorders), etc.).
5) Don't use non-validated (tested) analytical methods for lipid/metabolite quantitation.
6) Don't publish articles without access to raw and/or processed data. Consider submitting raw data to a repository (if possible) and publishing all processed data in the supplementary materials.
Steps 3 through 4 (batch effect normalizations)
The data frames obtained from processing mass spectrometry data are rarely ready for statistical analysis. Several issues have to be addressed before data mining and visualization are performed. For instance, one issue to consider is a difference in mass spectrometer response across the entire sequence analysis, also known as the batch effect. Please see the figure below showing the simplified concept of batch effect correction:
Notably, the GitBook contains a chapter dedicated to batch effect correction (please see the previous chapter entitled "Batch effect corrections in R").
Here are some recommendations concerning batch effect corrections:
DOs:
1) Always inspect your data for batch effects, e.g., by plotting all samples in an x-y scatter plot, where the y-axis reflects instrument response or concentration of a lipid/metabolite while the x-axis presents the order of samples measured in the sequence. Alternatively, use PCA analysis, color samples by batches, and check if the main sources of variance (PC1, PC2) separate your data set according to analytical batches.
2) Always use at least one type of QC sample, as most batch effects can be readily noticed and mitigated based on repeatedly injected QC samples. This could be, e.g.:
Extracts of the sample obtained by pooling biological material aliquots from all participants,
Certified reference materials extracts, e.g., SRM NIST 1950 for plasma/serum,
System suitability standards (SSS) which are composed of analytical standards covering all lipid/metabolite classes detectable within an optimized MS method.
3) Apply one of the corrections from the GitBook if you confirm batch-related effects. A simple, NIST-based batch effect correction proposed by Chocholouskova et al. (DOI: https://doi.org/10.1016/j.talanta.2021.122367) can also be used for plasma/serum samples.
4) Make sure correct computations are performed, and batch-related effects are mitigated. For instance, reinspect scatter plots or PCA score plots.
DON'Ts:
1) Don't skip the batch effect correction step. Ensure you perform batch effect correction, as batch-related variations can impact data interpretation and obscure critical insights.
2) Don't make chaotic sample measurements; instead, carefully plan your lipidomics/metabolomics sequence before approaching the mass spec, including batch size, QC, SSS, and blank sample injections. Keep the trace of the measured sequence with all collected data.
Steps 4 through 5 (data filtrations)
Once batch-related effects are removed, the data should be filtered. The filtration can be performed for both lipid/metabolite concentrations and entire samples.
DOs:
1) Filter out all variables with predominant missing values, e.g., more than 35-40%. Any missing value imputation method will artificially 'create' a fair part of your data in such a case. This can hardly represent biological information of interest.
2) Consider filtering out features where the determined concentrations in QC samples show a high coefficient of variation (CV), calculated as the standard deviation divided by the mean, exceeding 30-40%. In such cases, the reliability of these concentrations is uncertain, especially for low-abundance lipids. If your analytical method is validated, check the limits of quantitation (LOQ) and detection (LOD) for selected analytes or analyte classes. Using these criteria, you can eliminate noisy, low-abundance features that are unlikely to reflect meaningful biological variation, as their concentrations cannot be reliably estimated with your method.
3) There must be a good reason to exclude whole samples from any dataset, e.g., these are strong outliers in the PCA score plot, most of the lipids or metabolites are missing for these samples, or they are unusually upregulated, far above the mean value.
4) Inspect your data after filtration, i.e., how much information was cropped and why.
5) Always present the details of data filtration in the materials and methods section for both lipids/metabolites and samples.
DON'Ts:
1) Don't remove/eliminate samples that don't support your hypothesis to benefit it.
2) In general, don't remove samples from your datasets without careful consideration or a good reason to do so.
3) Don't use filtration methods that discard all variables with missing values. You will likely lose valuable insights.
4) Don't apply rigorous filtration criteria for outlying concentrations, e.g., trimming all outlying values based on box plots. As mentioned above, you can accidentally remove valuable insights. Consider, for instance, data transformation methods rather than excluding measurements (or entire samples) if your chosen statistical method or algorithm is sensitive to significant outliers.
Alternatively, in some cases, you can consider methods like Winsorization to trim the extreme values and substitute them with less extreme percentiles. This approach is commonly used in lipidomics and metabolomics, e.g., in mass spectrometry imaging, to represent the lipid or metabolite distribution maps better.
Steps 5 through 6 (missing values imputation)
This GitBook contains a dedicated chapter on missing values - see "Missing Values Handling in R" and "Missing Values Handling in Python".
Here, we will summarize the key DOs and DON'Ts.
DOs:
1) Compare your processed data frames with the raw chromatograms and/or spectra to confirm whether the signals are truly missing. If signals are absent in the data frames but present in the raw data, consider adjusting the data processing parameters or reprocessing the dataset. If your method is not well-optimized and validated, leading to many randomly distributed missing values, additional adjustments, and dataset remeasurement may be necessary. Alternatively, if only a few inconsistencies (pitfalls) are identified, manual corrections based on the raw data can be applied.
2) Try identifying the reasons behind missing values in your dataset to apply a suitable imputation method, i.e., whether you are dealing with MCAR, MAR, or MNAR. This, however, may not be easy in all cases. Left-censored missing values are typical for most well-optimized and validated methods (i.e., signals below the detection limit); however, occasionally, the signals are MAR, e.g., suppressed by other analytes, which may not be immediately apparent. Rarely MCAR happen, per definition, unrelated to observed or unobserved values in the dataset, i.e., as an effect of a purely random event.
3) Consider imputation by random forest, k-nearest neighbors (kNN), or column means/medians for MCAR or MAR.
4) Consider kNN, QRILC, or percentage of features' lowest concentration for imputation of MNAR (left-censored values).
5) Review your data after imputing missing values to ensure the overall data structure remains relatively intact.
6) Be aware that data imputation always creates the risk of changing "data architecture". To avoid the pitfalls of data imputation, consider applying robust algorithms to missing values like PCA or statistical tests. For specific examples - please see the chapter "Missing Values Handling in R"—subchapter "Missing values—Introduction").
7) Always describe missing values imputation protocol in your article's materials and methods section.
DON'Ts:
1) Don't impute data frames if you have identified the reason behind your randomly and frequently occurring missing values. Remedy the situation and address specific issues. If your data processing settings are incorrect, reprocess your raw data. If missing values are the effect of insufficient method optimization/validation, consider remeasuring your data.
2) Don't jump straight into statistical analysis. Always examine the effects of data imputation to ensure that the chosen method hasn't overly influenced the trends.
Steps 6 through 7 (data inspection before transformation)
At this step, your data are processed. It's time to investigate their properties. At this step, you can also compute all descriptive statistics. After data inspection, perform univariate testing, and you can also apply some of the machine learning approaches for data classification/regression, e.g., random forest.
DOs:
1) Always familiarize yourself with the properties of your dataset, e.g.:
Plot histograms or density plots. Analyze the shapes of the distribution and the presence of extreme values. Compare variances of each collected sample of populations (experimental groups) visually;
Compute central tendency measures for every experimental group, i.e., mean (symmetric distributions) or median, mode (skewed distributions);
Compute data dispersion or range measures, i.e., standard deviation or variance (symmetric distribution, mean calculated), or range, interquartile range (skewed distributions, median computed);
Plot box plots with dot plots to learn more about the collected data visually, e.g., Q1-3 locations, total range, fences, and analyze outliers carefully;
Recheck the coefficient of variation (CV, standard deviation divided by mean, in %) for all analytes in pooled QC samples;
Estimate population size (n) for each group and the total number of identified biomolecules in each class. Recheck for proper formatting (correct assignment of samples to groups, sample codes, etc.);
Recheck Principal Component Analysis (PCA) for outliers and grouping of QC samples. If your QC samples group in the PCA score plot - that points toward the integrity of the overall applied methodology.
Plot correlation heat maps, scatter plots, or compute correlations between lipids/metabolites. Analyze these relationships carefully and consider how they may influence the preparation of, e.g., classification models.
Ensure that all variables are expressed in proper units.
Look for further potential biases.
Such data inspection can be facilitated through, e.g., the GGally library in R.
2) Based on your data's properties, make decisions about further data transformation and scaling and select appropriate methods for data mining.
DON'Ts:
1) Never proceed to data transformation and scaling without familiarizing yourself with data properties. Your data may not need further transformations.
2) Never blindly choose data mining methods without learning the data properties first. This is particularly valid for machine learning approaches. Always choose suitable algorithms. Define your hypothesis and goals clearly.
Steps 7 through 8 (data transformation and scaling)
Data transformation and scaling are the final steps before actual data mining. One needs to remember that in the case of machine learning, data are first split into train and test sets and then transformed and scaled, if necessary.
Here are some recommendations and points to be considered before transformation and scaling.
DOs:
1) Inspect your data and transform them only if necessary, e.g.:
If you use statistical methods sensitive to skewness, outlier influence, or heteroskedasticity (unequal variance) i.e., stabilize the mean and variance. This is for better performance of these algorithms, e.g., PCA, (O)PLS, Hierarchical Clustering (HC) relying on Euclidean distances, calculation of Pearson correlation coefficients, etc.
If the transformation of relationships from multiplicative to additive simplifies interpretations, e.g., for fold change analysis in metabolomics and lipidomics.
If the transformed data form simple linear relationships, e.g., the log-transformed instrument response in the linear function of the log of concentrations, etc.
If visualization of transformed data better exposes the insights, e.g., presenting data in a heat map form or reducing the influence of strong outliers in box plots connected with dot plots.
2) Select transformations that address specific issues in your data. For example, consider root squares for mild left skewness or logarithm transformations to address strong left skewness (outliers) or to reduce heteroskedasticity. Log transformation is among the most often used as it addresses common issues shared by most lipidomics and metabolomics data sets.
3) Use mean-centering before algorithms like PCA to facilitate the interpretability of results (here, eigenvalues are to be interpreted in terms of variance captured by each PC). Remember that mean-centering is performed, e.g., within Auto- and Pareto-scaling.
4) Scale lipidomics/metabolomics data using algorithms sensitive to different concentration ranges, like PCA, (O)PLS, Support Vector Machine (SVM), Hierarchical Clustering (HC) - particularly relying on Euclidean distances, etc. Among the most often applied methods are Pareto- and Auto-scaling.
5) Remember - first, transform your data. Next - scale your data.
6) Always try to understand what happens to the data during transformation and scaling and interpret the newly obtained values! Choose transformations that preserve meaningful biological differences. Reinspect the data after transformation and scaling. Assess the influence on their shape, variance, outliers, etc.
7) Always describe how the data were transformed and scaled in your article's materials and methods section.
For more information, please refer to the earlier-mentioned article. Look also at the following note on R-bloggers, summarizing the most essential information:
DON'Ts:
1) Don't scale the data before performing the transformation.
2) Don't use more than one transformation and scaling method. Multiple transformation and/or scaling methods complicate biological data interpretability (or make it impossible to conduct). Typically, one transformation is applied first, followed by one scaling method.
3) Pay attention to the computations performed by R and Python. For example, mean-centering is performed in the numerator when Pareto and Auto-scaling are applied. Don't double mean-center your data.
4) Don't blindly assume that one transformation and scaling method is always suitable. Inspect the data after transformation and scaling and select the best-performing techniques for each new data set you produce.
5) Avoid transforming data containing missing values or apply appropriate corrections, e.g., log(x + C).
6) Don't assume that data transformation and scaling solve the problem of batch effects.
7) Don't apply transformation and/or scaling before splitting the data in machine learning. This may artificially increase your model's performance.
8) In machine learning-based approaches, don't apply different transformations/scaling to training and test sets after the data are split.
Remember, lipidomics/metabolomics datasets often require case-specific preprocessing. Now, let's move to data transformation and scaling in R.
Last updated