# Data Transformation and scaling in Python

For an introduction to the principles of transforming and scaling data, we refer to the R section of this tutorial.

{% content-ref url="/pages/WEi1PiDi3UbY6UX4gTgl" %}
[Data transformation and scaling - introduction](/omics-data-visualization-in-r-and-python/data-transformation-scaling-and-normalization-in-r/data-transformation-and-scaling-introduction.md)
{% endcontent-ref %}

## Required packages

The required packages for this section are pandas, scikit-learn and numpy. These can be installed with the following command in the command window (Windows) / terminal (Mac). (Numpy will be installed automatically when installing Pandas)

```
pip install pandas scikit-learn
```

## Loading the data

In this tutorial we will use the lipidomics demo dataset is excel format, which you can download with the link below.

{% file src="/files/O8GcvqGYOzpXcfU7NayN" %}

Place the downloaded Lipidomics\_dataset.xlsx file in the same folder as your JupyterLab script. Then run the following code in Jupyter:

```python
import pandas as pd
df = pd.read_excel("Lipidomics_dataset.xlsx", decimal=",")
```

## Transformation

Data transformations help in normalising distributions, improving model performance, and handling skewed data. The most common transformation in lipidomics data processing is the log transformation. Applying a log transformation to a DataFrame can be as simple as:

```python
df.apply(np.log) #this will give an error on the lipidomics demo dataset
```

On DataFrames that contain non numerical columns however, such as ours, the above code will give an error and we'll have to make sure we only apply the transformation to the numerical columns. Instead of the above code we can use:

```python
# Apply log transformation to numeric columns
df_numeric = df.select_dtypes(include=["number"]).apply(lambda x: np.log(x))

# Replace original numeric columns with transformed values
df.update(df_numeric)
```

To apply a different kind of transformation, we simply need to replace the call to np.log with a different numpy function. For example if we want to use square root transformation, we can use np.sqrt:

```python
# Apply square root transformation to numeric columns
df_numeric = df.select_dtypes(include=["number"]).apply(lambda x: np.sqrt(x))
df.update(df_numeric)
```

## Scaling

For scaling the data in a Pandas DataFrame, we have created the following utility function:

```python
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import pandas as pd
import numpy as np

def apply_scaling(data, method):
    # Select numeric columns only
    data_numeric = data.select_dtypes(include=["number"]).copy()

    if method == "Centering":
        # Centering (subtract the mean for each column)
        data_numeric = data_numeric.apply(lambda x: x - x.mean(), axis=0)

    elif method == "Autoscaling":
        # Standardization (subtract the mean and divide by the standard deviation)
        scaler = StandardScaler()
        data_numeric = pd.DataFrame(
            scaler.fit_transform(data_numeric),
            columns=data_numeric.columns,
            index=data_numeric.index
        )

    elif method == "Range Scaling":
        # Min-Max scaling (scale to the range [0, 1])
        scaler = MinMaxScaler()
        data_numeric = pd.DataFrame(
            scaler.fit_transform(data_numeric),
            columns=data_numeric.columns,
            index=data_numeric.index
        )

    elif method == "Pareto Scaling":
        # Pareto scaling: (x - mean) / sqrt(std)
        data_numeric = (data_numeric - data_numeric.mean()) / np.sqrt(data_numeric.std())

    elif method == "Vast Scaling":
        # Vast scaling (subtract the mean, divide by std, multiply by a factor of 10)
        data_numeric = (data_numeric - data_numeric.mean()) / (data_numeric.std() * 10)

    elif method == "Level Scaling":
        # Level scaling (divide by the mean)
        data_numeric = data_numeric / data_numeric.mean()

    # Update the original DataFrame in-place for numeric columns
    data_return = data.copy()
    data_return.update(data_numeric)

    return data_return

```

To use this on our data we can simply call:

```python
df_centered = apply_scaling(df, "Centering")
df_autoscaled = apply_scaling(df, "Autoscaling")
df_range_scaled = apply_scaling(df, "Range Scaling")
df_pareto_scaled = apply_scaling(df, "Pareto Scaling")
df_vast_scaled = apply_scaling(df, "Vast Scaling")
df_level_scaled = apply_scaling(df, "Level Scaling")
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://laboratory-of-lipid-metabolism-a.gitbook.io/omics-data-visualization-in-r-and-python/data-transformation-scaling-and-normalization-in-python/data-transformation-and-scaling-in-python.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
