Data Transformation and scaling in Python
For an introduction to the principles of transforming and scaling data, we refer to the R section of this tutorial.
Data transformation and scaling - introductionRequired packages
The required packages for this section are pandas, scikit-learn and numpy. These can be installed with the following command in the command window (Windows) / terminal (Mac). (Numpy will be installed automatically when installing Pandas)
pip install pandas scikit-learn
Loading the data
In this tutorial we will use the lipidomics demo dataset is excel format, which you can download with the link below.
Place the downloaded Lipidomics_dataset.xlsx file in the same folder as your JupyterLab script. Then run the following code in Jupyter:
import pandas as pd
df = pd.read_excel("Lipidomics_dataset.xlsx", decimal=",")
Transformation
Data transformations help in normalising distributions, improving model performance, and handling skewed data. The most common transformation in lipidomics data processing is the log transformation. Applying a log transformation to a DataFrame can be as simple as:
df.apply(np.log) #this will give an error on the lipidomics demo dataset
On DataFrames that contain non numerical columns however, such as ours, the above code will give an error and we'll have to make sure we only apply the transformation to the numerical columns. Instead of the above code we can use:
# Apply log transformation to numeric columns
df_numeric = df.select_dtypes(include=["number"]).apply(lambda x: np.log(x))
# Replace original numeric columns with transformed values
df.update(df_numeric)
To apply a different kind of transformation, we simply need to replace the call to np.log with a different numpy function. For example if we want to use square root transformation, we can use np.sqrt:
# Apply square root transformation to numeric columns
df_numeric = df.select_dtypes(include=["number"]).apply(lambda x: np.sqrt(x))
df.update(df_numeric)
Scaling
For scaling the data in a Pandas DataFrame, we have created the following utility function:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import pandas as pd
import numpy as np
def apply_scaling(data, method):
# Select numeric columns only
data_numeric = data.select_dtypes(include=["number"]).copy()
if method == "Centering":
# Centering (subtract the mean for each column)
data_numeric = data_numeric.apply(lambda x: x - x.mean(), axis=0)
elif method == "Autoscaling":
# Standardization (subtract the mean and divide by the standard deviation)
scaler = StandardScaler()
data_numeric = pd.DataFrame(
scaler.fit_transform(data_numeric),
columns=data_numeric.columns,
index=data_numeric.index
)
elif method == "Range Scaling":
# Min-Max scaling (scale to the range [0, 1])
scaler = MinMaxScaler()
data_numeric = pd.DataFrame(
scaler.fit_transform(data_numeric),
columns=data_numeric.columns,
index=data_numeric.index
)
elif method == "Pareto Scaling":
# Pareto scaling: (x - mean) / sqrt(std)
data_numeric = (data_numeric - data_numeric.mean()) / np.sqrt(data_numeric.std())
elif method == "Vast Scaling":
# Vast scaling (subtract the mean, divide by std, multiply by a factor of 10)
data_numeric = (data_numeric - data_numeric.mean()) / (data_numeric.std() * 10)
elif method == "Level Scaling":
# Level scaling (divide by the mean)
data_numeric = data_numeric / data_numeric.mean()
# Update the original DataFrame in-place for numeric columns
data_return = data.copy()
data_return.update(data_numeric)
return data_return
To use this on our data we can simply call:
df_centered = apply_scaling(df, "Centering")
df_autoscaled = apply_scaling(df, "Autoscaling")
df_range_scaled = apply_scaling(df, "Range Scaling")
df_pareto_scaled = apply_scaling(df, "Pareto Scaling")
df_vast_scaled = apply_scaling(df, "Vast Scaling")
df_level_scaled = apply_scaling(df, "Level Scaling")
Last updated