Uniform Manifold Approximation and Projection (UMAP)

Metabolites and lipids multivariate statistical analysis in R

Practical applications of UMAP (examples)

UMAP is another example of a non-linear dimensionality reduction technique designed to represent high-dimensional data (e.g., lipid or metabolite concentrations) in a lower-dimensional space (typically two or three dimensions). Like t-SNE, its application in the field of -omics has gained growing interest in recent years, particularly in genomics and transcriptomics. However, an increasing number of lipidomics and metabolomics manuscripts have recently adopted UMAP for data analysis.

Here, please find selected examples of UMAP applications:

  • J. Wu et al. Lipidomic signatures align with inflammatory patterns and outcomes in critical illness. DOI: https://doi.org/10.1038/s41467-022-34420-4 - Fig. 1c, Fig. 2a & b, Fig. 3a & b (the authors of the study published in Nature Communications use UMAP multiple times to present sample clustering based on lipidomics data).

  • Y. Wang et al. Single-Cell Time-Resolved Metabolomics and Lipidomics Reveal Apoptotic and Ferroptotic Heterogeneity during Foam Cell Formation. DOI: https://doi.org/10.1038/s41467-022-34420-4 - Fig. 3A (the authors of the study published in Analytical Chemistry present UMAP analysis of single-cell lipidomics and metabolomics data set).

Required packages

The required packages for this section are uwot, Rtsne, scales, and ggrepel. These can be installed with the following command in the command window (Windows) / terminal (Mac):

# Installation of all required packages:
install.packages("uwot")
install.packages("scales")
install.packages("ggrepel")

# Activate libraries:
library(uwot)
library(scales)
library(ggrepel)

# Additionally, activate tidyverse:
library(tidyverse)

Loading data into R

Here, we will use the data set presented in the manuscript:

The lipidomics data set published by Kvasnička et al. in their manuscript Alterations in lipidome profiles distinguish early-onset hyperuricemia, gout, and the effect of urate-lowering treatment; Arthritis Research & Therapy (2023).

Always ensure you have set the appropriate working directory (wd). If you haven't done that yet, this is the first line of the code, followed by loading data into R. Read the data into R with the 'read_excel()' function from the readxl package we saw earlier in the GitBook. Set as a data.frame to make it easier to handle the data:

# Setting a working directory (wd)
setwd('D:/Data analysis')

# Loading data into R:
data <- readxl::read_excel("GOUT_CTRL_QC_Ales_data_31012025.xlsx")
data <- as.data.frame(data)
head(data)

Next, set the `Sample Name` column as row names:

# Set the row names the same as `Sample Name`
rownames(data) <- data$`Sample Name`
data$`Sample Name` <- NULL

Principal Component Analysis

Usually, before running UMAP, PCA is performed on the high-dimensional data to reduce the dimensions (e.g., to 30, 50), and then UMAP is applied to the derived Principal Components:

The first step is to normalize the data such that the features (lipids) have zero mean and unit variance; this can quickly be done with the `scale()` function. By indexing the data frame with `data[,-1]`, we will select all the data in the data frame except for the first column, which contains the labels. The data frame is re-annotated to df_normalized to make it clear that we are working with normalized data:

# Data normalization before the PCA analysis (Auto-Scaling):
data_normalized <- data
data_normalized[, -1] <- scale(data[, -1])

Next, we use the PCA with the `prcomp()` function. We will select the 50 first principal components, and we will apply the PCA algorithm to the normalized data:

# Principal Component Analysis:
n_components <- 50
pca_result <- prcomp(data_normalized[, -1], center = FALSE)
pca_features <- pca_result$x[, 1:n_components]

UMAP

Then, using `umap()` function from the uwot package, we will apply the UMAP algorithm to the 50 Principal Components:

# Applying UMAP:
n_components <- 2
umap_features <- umap(pca_features, n_components = n_components)

The `umap()` function from the uwot library returns the results as a matrix, so let's convert these results into a data frame for easier handling:

# Transforming matrix of results into a data frame:
umap_df <- data.frame(X = umap_features[, 1], Y = umap_features[, 2])
umap_df$Label <- data_normalized$Label

We can now visualize the projection of the samples to the new feature space with a scatter plot (through ggplot2):

# Visualizing the UMAP score plot:
UMAP <- ggplot(umap_df, aes(x = X, y = Y, color = Label)) +
  geom_point(alpha = 0.7) +
  theme_minimal() +
  labs(
    title = "UMAP Visualization",
    x = "UMAP1",
    y = "UMAP2"
  ) +
  scale_color_manual(values = c("Gout" = "red", "Control" = "blue")) + 
  theme(legend.title = element_blank()) 
  
# Exporting a high-quality publication-plot: 
# Call library:
library(ggimage)

# Exporting high-quality score plot.
## Generate a preview and optimize the plot presentation (tSNE scores plot):
ggpreview(plot = UMAP,               # The object that you want to preview.
          width = 300,               # Width in px.
          height = 300,              # Height in px.
          units = "px",              # Unit - of size - px.
          dpi = 300,                 # Sharpness.
          scale = 6)            # You may need to use a different scale.


## Save the plot in the working directory using ggsave (ggplot2 package - tidyverse):
ggsave(plot = UMAP,    # The R object to be saved.        
       device = "svg",           # Format.
       filename = "UMAP_gout_FINAL.svg",
       width = 300,
       height = 300,
       units = "px",
       dpi = 300,
       scale = 6)

The obtained UMAP visualization:

The UMAP visualization of the lipidomics data set. Gray points correspond to QC samples.

Last updated