Data Transformation and scaling in Python

For an introduction to the principles of transforming and scaling data, we refer to the R section of this tutorial.

Data transformation and scaling - introduction

Required packages

The required packages for this section are pandas, scikit-learn and numpy. These can be installed with the following command in the command window (Windows) / terminal (Mac). (Numpy will be installed automatically when installing Pandas)

pip install pandas scikit-learn

Loading the data

In this tutorial we will use the lipidomics demo dataset is excel format, which you can download with the link below.

Place the downloaded Lipidomics_dataset.xlsx file in the same folder as your JupyterLab script. Then run the following code in Jupyter:

import pandas as pd
df = pd.read_excel("Lipidomics_dataset.xlsx", decimal=",")

Transformation

Data transformations help in normalising distributions, improving model performance, and handling skewed data. The most common transformation in lipidomics data processing is the log transformation. Applying a log transformation to a DataFrame can be as simple as:

On DataFrames that contain non numerical columns however, such as ours, the above code will give an error and we'll have to make sure we only apply the transformation to the numerical columns. Instead of the above code we can use:

To apply a different kind of transformation, we simply need to replace the call to np.log with a different numpy function. For example if we want to use square root transformation, we can use np.sqrt:

Scaling

For scaling the data in a Pandas DataFrame, we have created the following utility function:

To use this on our data we can simply call:

Last updated