Uniform Manifold Approximation and Projection

Required packages

The required packages for this section are pandas, seaborn, scikit-learn and umap-learn. These can be installed with the following command in the command window (Windows) / terminal (Mac).

pip install pandas seaborn scikit-learn umap-learn

Loading the data

For this section the following dataset will be used:

3MB

GOUT_CTRL_QC_Ales_data_31012025.xlsx

Load the dataset into a Pandas DataFrame named df as described in the basic plotting section:

import pandas as pd
df = pd.read_excel("GOUT_CTRL_QC_Ales_data_31012025.xlsx", decimal=",")
df.set_index("Sample Name", inplace=True)

PCA

Usually before running UMAP, PCA is performed on the high-dimensional data to reduce the dimensions (e.g. 30, 50) and then UMAP is applied on the derived Principal Components.

Data normalisation

The first step is to normalise the data such that the features (lipids) have zero mean and unit variance, this can easily be done with the StandardScaler from sklearn. By indexing the dataframe with df.iloc[:,1:] we'll select all the data in the dataframe, except for the first column, which contains the labels:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_normalised = scaler.fit_transform(df.iloc[:,1:])

PCA

Next, we use the PCA sklearn from sklearn, we'll select the 50 first principal components and we'll apply the PCA algorithm to the normalised data:

from sklearn.decomposition import PCA

n_components = 50
pca = PCA(n_components=n_components)
pca_features = pca.fit_transform(df_normalised)

UMAP

Then using UMAP from umap-learn we'll apply the UMAP algrorithm on the 50 Principal Components.

from umap import UMAP

n_components = 2
umap = UMAP(n_components=n_components)
umap_features = umap.fit_transform(pca_features)

The UMAP function from umap-learn returns the results as a numpy ndarray, let's put these results in a Pandas DataFrame:

umap_features = pd.DataFrame(data=umap_features,
                      columns=[f"UMAP{i+1}" for i in range(n_components)],
                      index=df.index)
umap_features["Label"] = df.Label

We can now visualise the projection of the samples to the new feature space with a scatterplot:

import seaborn as sns
import matplotlib.pyplot as plt

sns.scatterplot(x='UMAP1',
       y='UMAP2',
       data=umap_features,
       hue="Label",
       palette=["red", "royalblue"]);
plt.show()

Previoust-Distributed Stochastic Neighbor Embedding NextPLS Discriminant Analysis

Last updated 4 months ago