> For the complete documentation index, see [llms.txt](https://laboratory-of-lipid-metabolism-a.gitbook.io/omics-data-visualization-in-r-and-python/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://laboratory-of-lipid-metabolism-a.gitbook.io/omics-data-visualization-in-r-and-python/metabolites-and-lipids-multivariate-statistical-analysis-in-python/clustered-heatmaps.md).

# Clustered heatmaps

## Required packages

The required packages for this section are pandas and seaborn. These can be installed with the following command in the command window (Windows) / terminal (Mac).

```
pip install pandas seaborn
```

## Loading the data

Like in the other sections we will use the lipidomics demo dataset:

{% file src="/files/O8GcvqGYOzpXcfU7NayN" %}

```python
import pandas as pd
df = pd.read_excel("Lipidomics_dataset.xlsx", decimal=",")
df.set_index("Sample Name", inplace=True)
```

## Clustered heatmaps

### Data normalisation

The first step is to normalise the data such that the features (lipids) have zero mean and unit variance, this can easily be done with the StandardScaler from sklearn. By indexing the dataframe with df.iloc\[:,1:] we'll select all the data in de dataframe, except for the first column, which contains the labels:

```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_normalised = scaler.fit_transform(df.iloc[:,1:])
df_normalised = pd.DataFrame(df_normalised, index=df.index, columns=df.columns[1:])
```

### Heatmap

To obtain a clustered heatmap, we can sue the clustermap function from seaborn:

```python
import seaborn as sns
import matplotlib.pyplot as plt

sns.clustermap(df_normalised, 
               figsize=(40,60),
               method="centroid",
               metric="euclidean",
               row_cluster=True,
               col_cluster=True);
               
plt.show()
```

We set the figsize to values that result in the labels being shown for each row and colunm, with method we configure how the distance between clusters is calculated, and with metric how the distance between 2 n-dimensional vectors is calculated. All possible options are described in the Scipy documentation for [method](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html#scipy.cluster.hierarchy.linkage) and [metric](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html#scipy.spatial.distance.pdist). Row or column clustering can optionally bet set to False.

We can save the heatmap to a file with:

```python
import matplotlib.pyplot as plt
plt.savefig('python_clusterhmap.png', dpi=300, bbox_inches='tight')
```

<figure><img src="/files/yAKmXV0yHHDPmRuxvfHw" alt=""><figcaption></figcaption></figure>

**Note:** depending on the size of the dataset, you may get a warning stating that "Installing *fastcluster* may give better performance". This warning can be ignored, or alternatively if you do experience the clustering to be too slow you can install the *fastcluster* package with **"pip install fastcluster"**.

We can also add an extra color map to indicate the Labels of the samples:

```python
lut = dict(zip(df.Label.unique(), "rbg"))
row_colors = df.Label.map(lut)

sns.clustermap(df_normalised, 
               figsize=(40,60),
               method="centroid",
               metric="euclidean",
               row_cluster=True,
               col_cluster=True,
               row_colors=row_colors
              );
plt.show()
```

<figure><img src="/files/7QzTT9GjS7AfuwkMzmwo" alt=""><figcaption></figcaption></figure>

### Selecting a subset of interesting species

A more compact and visually more interesting heatmap can be obtained by filtering for more interesting lipid species, like for example species that are significantly different between the groups. Here we will do an ANOVA test and only keep species with p < 0.0001.

We adapt our previous ANOVA code test to run a loop and perform the ANOVA for each species. First we'll replace any special character with an underscore, since statsmodels does not currently accept these. During this loop we'll store the p-values in a Pandas.Series object named p\_values. Next we select all p-values smaller than 0.0001 and use this to index our original DataFrame.

```python
from statsmodels.formula.api import ols
from statsmodels.api import stats

df2 = df.copy()
df2.columns = df2.columns.str.rstrip()
df2.columns = df2.columns.str.replace(' ', '_')
df2.columns = df2.columns.str.replace(':', '_')
df2.columns = df2.columns.str.replace(';', '_')
df2.columns = df2.columns.str.replace('/', '_')
df2.columns = df2.columns.str.replace('-', '_')

p_values = pd.Series(dtype=float)
for species in df2.columns[1:]:
    model = ols(f"{species} ~ Label", data=df2).fit()
    ow_anova_table = stats.anova_lm(model, typ=2)
    p_value = ow_anova_table.loc["Label", "PR(>F)"]
    p_values[species] = p_value
    
index = (p_values < 0.0001).values
df2 = df.iloc[:,1:].loc[:,index]
```

Like previously, we'll preform standard scaling on this data and we use the clustermap from Seaborn to draw the heatmap:

```python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_normalised = scaler.fit_transform(df2)
df_normalised = pd.DataFrame(df_normalised, index=df.index, columns=df2.columns)

sns.clustermap(df_normalised, 
               figsize=(15,60),
               method="centroid",
               metric="euclidean",
               row_cluster=True,
               col_cluster=True);
plt.show()
```

<figure><img src="/files/K49w1ZXlLVH7voyWLPH9" alt=""><figcaption></figcaption></figure>


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://laboratory-of-lipid-metabolism-a.gitbook.io/omics-data-visualization-in-r-and-python/metabolites-and-lipids-multivariate-statistical-analysis-in-python/clustered-heatmaps.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
