Data Transformation and scaling in Python

For an introduction to the principles of transforming and scaling data, we refer to the R section of this tutorial.

Data transformation and scaling - introduction

Required packages

The required packages for this section are pandas, scikit-learn and numpy. These can be installed with the following command in the command window (Windows) / terminal (Mac). (Numpy will be installed automatically when installing Pandas)

pip install pandas scikit-learn

Loading the data

In this tutorial we will use the lipidomics demo dataset is excel format, which you can download with the link below.

348KB

Lipidomics_dataset.xlsx

Place the downloaded Lipidomics_dataset.xlsx file in the same folder as your JupyterLab script. Then run the following code in Jupyter:

import pandas as pd
df = pd.read_excel("Lipidomics_dataset.xlsx", decimal=",")

Transformation

Data transformations help in normalising distributions, improving model performance, and handling skewed data. The most common transformation in lipidomics data processing is the log transformation. Applying a log transformation to a DataFrame can be as simple as:

df.apply(np.log) #this will give an error on the lipidomics demo dataset

On DataFrames that contain non numerical columns however, such as ours, the above code will give an error and we'll have to make sure we only apply the transformation to the numerical columns. Instead of the above code we can use:

# Apply log transformation to numeric columns
df_numeric = df.select_dtypes(include=["number"]).apply(lambda x: np.log(x))

# Replace original numeric columns with transformed values
df.update(df_numeric)

To apply a different kind of transformation, we simply need to replace the call to np.log with a different numpy function. For example if we want to use square root transformation, we can use np.sqrt:

# Apply square root transformation to numeric columns
df_numeric = df.select_dtypes(include=["number"]).apply(lambda x: np.sqrt(x))
df.update(df_numeric)

Scaling

For scaling the data in a Pandas DataFrame, we have created the following utility function:

from sklearn.preprocessing import StandardScaler, MinMaxScaler
import pandas as pd
import numpy as np

def apply_scaling(data, method):
    # Select numeric columns only
    data_numeric = data.select_dtypes(include=["number"]).copy()

    if method == "Centering":
        # Centering (subtract the mean for each column)
        data_numeric = data_numeric.apply(lambda x: x - x.mean(), axis=0)

    elif method == "Autoscaling":
        # Standardization (subtract the mean and divide by the standard deviation)
        scaler = StandardScaler()
        data_numeric = pd.DataFrame(
            scaler.fit_transform(data_numeric),
            columns=data_numeric.columns,
            index=data_numeric.index
        )

    elif method == "Range Scaling":
        # Min-Max scaling (scale to the range [0, 1])
        scaler = MinMaxScaler()
        data_numeric = pd.DataFrame(
            scaler.fit_transform(data_numeric),
            columns=data_numeric.columns,
            index=data_numeric.index
        )

    elif method == "Pareto Scaling":
        # Pareto scaling: (x - mean) / sqrt(std)
        data_numeric = (data_numeric - data_numeric.mean()) / np.sqrt(data_numeric.std())

    elif method == "Vast Scaling":
        # Vast scaling (subtract the mean, divide by std, multiply by a factor of 10)
        data_numeric = (data_numeric - data_numeric.mean()) / (data_numeric.std() * 10)

    elif method == "Level Scaling":
        # Level scaling (divide by the mean)
        data_numeric = data_numeric / data_numeric.mean()

    # Update the original DataFrame in-place for numeric columns
    data_return = data.copy()
    data_return.update(data_numeric)

    return data_return

To use this on our data we can simply call:

df_centered = apply_scaling(df, "Centering")
df_autoscaled = apply_scaling(df, "Autoscaling")
df_range_scaled = apply_scaling(df, "Range Scaling")
df_pareto_scaled = apply_scaling(df, "Pareto Scaling")
df_vast_scaled = apply_scaling(df, "Vast Scaling")
df_level_scaled = apply_scaling(df, "Level Scaling")

PreviousData Normalization – bestNormalize R package NextComputing descriptive statistics in R

Last updated 1 month ago