💪
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  • Required packages
  • Loading the data
  • Transformation
  • Scaling
  1. Data transformation, scaling, and normalization in Python

Data Transformation and scaling in Python

PreviousData Normalization – bestNormalize R packageNextComputing descriptive statistics in R

Last updated 16 days ago

For an introduction to the principles of transforming and scaling data, we refer to the R section of this tutorial.

Required packages

The required packages for this section are pandas, scikit-learn and numpy. These can be installed with the following command in the command window (Windows) / terminal (Mac). (Numpy will be installed automatically when installing Pandas)

pip install pandas scikit-learn

Loading the data

In this tutorial we will use the lipidomics demo dataset is excel format, which you can download with the link below.

Place the downloaded Lipidomics_dataset.xlsx file in the same folder as your JupyterLab script. Then run the following code in Jupyter:

import pandas as pd
df = pd.read_excel("Lipidomics_dataset.xlsx", decimal=",")

Transformation

Data transformations help in normalising distributions, improving model performance, and handling skewed data. The most common transformation in lipidomics data processing is the log transformation. Applying a log transformation to a DataFrame can be as simple as:

df.apply(np.log) #this will give an error on the lipidomics demo dataset

On DataFrames that contain non numerical columns however, such as ours, the above code will give an error and we'll have to make sure we only apply the transformation to the numerical columns. Instead of the above code we can use:

# Apply log transformation to numeric columns
df_numeric = df.select_dtypes(include=["number"]).apply(lambda x: np.log(x))

# Replace original numeric columns with transformed values
df.update(df_numeric)

To apply a different kind of transformation, we simply need to replace the call to np.log with a different numpy function. For example if we want to use square root transformation, we can use np.sqrt:

# Apply square root transformation to numeric columns
df_numeric = df.select_dtypes(include=["number"]).apply(lambda x: np.sqrt(x))
df.update(df_numeric)

Scaling

For scaling the data in a Pandas DataFrame, we have created the following utility function:

from sklearn.preprocessing import StandardScaler, MinMaxScaler
import pandas as pd
import numpy as np

def apply_scaling(data, method):
    # Select numeric columns only
    data_numeric = data.select_dtypes(include=["number"]).copy()

    if method == "Centering":
        # Centering (subtract the mean for each column)
        data_numeric = data_numeric.apply(lambda x: x - x.mean(), axis=0)

    elif method == "Autoscaling":
        # Standardization (subtract the mean and divide by the standard deviation)
        scaler = StandardScaler()
        data_numeric = pd.DataFrame(
            scaler.fit_transform(data_numeric),
            columns=data_numeric.columns,
            index=data_numeric.index
        )

    elif method == "Range Scaling":
        # Min-Max scaling (scale to the range [0, 1])
        scaler = MinMaxScaler()
        data_numeric = pd.DataFrame(
            scaler.fit_transform(data_numeric),
            columns=data_numeric.columns,
            index=data_numeric.index
        )

    elif method == "Pareto Scaling":
        # Pareto scaling: (x - mean) / sqrt(std)
        data_numeric = (data_numeric - data_numeric.mean()) / np.sqrt(data_numeric.std())

    elif method == "Vast Scaling":
        # Vast scaling (subtract the mean, divide by std, multiply by a factor of 10)
        data_numeric = (data_numeric - data_numeric.mean()) / (data_numeric.std() * 10)

    elif method == "Level Scaling":
        # Level scaling (divide by the mean)
        data_numeric = data_numeric / data_numeric.mean()

    # Update the original DataFrame in-place for numeric columns
    data_return = data.copy()
    data_return.update(data_numeric)

    return data_return

To use this on our data we can simply call:

df_centered = apply_scaling(df, "Centering")
df_autoscaled = apply_scaling(df, "Autoscaling")
df_range_scaled = apply_scaling(df, "Range Scaling")
df_pareto_scaled = apply_scaling(df, "Pareto Scaling")
df_vast_scaled = apply_scaling(df, "Vast Scaling")
df_level_scaled = apply_scaling(df, "Level Scaling")
Data transformation and scaling - introduction
348KB
Lipidomics_dataset.xlsx