For an introduction to the principles of transforming and scaling data, we refer to the R section of this tutorial.
Required packages
The required packages for this section are pandas, scikit-learn and numpy. These can be installed with the following command in the command window (Windows) / terminal (Mac). (Numpy will be installed automatically when installing Pandas)
pip install pandas scikit-learn
Loading the data
In this tutorial we will use the lipidomics demo dataset is excel format, which you can download with the link below.
Place the downloaded Lipidomics_dataset.xlsx file in the same folder as your JupyterLab script. Then run the following code in Jupyter:
import pandas as pd
df = pd.read_excel("Lipidomics_dataset.xlsx", decimal=",")
Transformation
Data transformations help in normalising distributions, improving model performance, and handling skewed data. The most common transformation in lipidomics data processing is the log transformation. Applying a log transformation to a DataFrame can be as simple as:
df.apply(np.log) #this will give an error on the lipidomics demo dataset
On DataFrames that contain non numerical columns however, such as ours, the above code will give an error and we'll have to make sure we only apply the transformation to the numerical columns. Instead of the above code we can use:
# Apply log transformation to numeric columns
df_numeric = df.select_dtypes(include=["number"]).apply(lambda x: np.log(x))
# Replace original numeric columns with transformed values
df.update(df_numeric)
To apply a different kind of transformation, we simply need to replace the call to np.log with a different numpy function. For example if we want to use square root transformation, we can use np.sqrt: