t-Distributed Stochastic Neighbor Embedding (t-SNE)
Metabolites and lipids multivariate statistical analysis in R
Last updated
Metabolites and lipids multivariate statistical analysis in R
Last updated
t-SNE is a non-linear dimensionality reduction technique designed to represent high-dimensional data (e.g., lipid or metabolite concentrations) in a lower-dimensional space (typically two or three dimensions) while preserving the similarities between neighboring data points, i.e., local structures. Its application in the field of -omics has gained increasing interest in recent years, especially in genomics and transcriptomics. t-SNE is more frequently featured in lipidomics and metabolomics studies, particularly those focused on single-cell -omics.
Here, please find examples of practical applications of this technique:
D. Hornburg et al. Dynamic lipidome alterations associated with human health, disease and ageing. DOI: - Fig. 2b (the authors utilized t-SNE to examine clustering based on the 100 most personalized lipids from 11 participants who provided at least 12 healthy samples).
S. E. Hancock et al. FACS-assisted single-cell lipidome analysis of phosphatidylcholines and sphingomyelins in cells of different lineages. DOI: - Fig. 4A (the authors use t-SNE to examine clustering of single-cell lipidomics dataset consisting of C2C12 & HepG2 cells grown in both control (CON) and docosahexaenoic acid (DHA)-supplemented media).
Z. Wang et al. Data-Driven Deciphering of Latent Lesions in Heterogeneous Tissue Using Function-Directed t-SNE of Mass Spectrometry Imaging Data. DOI: (t-SNE application for clustering of mass spectrometry imaging data).
H. Tian et al. Multimodal mass spectrometry imaging identifies cell-type-specific metabolic and lipidomic variation in the mammalian liver. DOI: (t-SNE in mass spectrometry imaging application - cell clustering based on lipid & metabolite profiles).
The required packages for this section are uwot, Rtsne, scales, and ggrepel. These can be installed with the following command in the command window (Windows) / terminal (Mac):
Here, we will use the data set presented in the manuscript:
Always ensure you have set the appropriate working directory (wd). If you haven't done that yet, this is the first line of the code, followed by loading data into R. Read the data into R with the 'read_excel()' function from the readxl package we saw earlier in the GitBook. Set as a data.frame to make it easier to handle the data:
Next, set the `Sample Name` column as row names:
Usually, before running t-SNE, PCA is performed on the high-dimensional data to reduce the dimensions (e.g., 30, 50), and then t-SNE is applied to the derived Principal Components:
The first step is to normalize the data such that the features (lipids) have zero mean and unit variance; this can easily be done with the 'scale function. By indexing the data frame with `data[,-1]`, we'll select all the data in the data frame except for the first column, which contains the labels. The data frame is re-annotated to data_normalized to make it clear that we are working with normalized data:
Next, we use the PCA with the `prcomp()` function. We'll select the 50 first principal components, and we'll apply the PCA algorithm to the normalized data:
Using the `Rtsne()` function from the Rtsne package, we'll apply the t-SNE algorithm on the 50 Principal Components:
The `Rtsne()` function returns the results as a matrix, so let's convert these results into a data frame for easier handling:
We can now visualize the projection of the samples to the new feature space with a scatter plot:
We obtain the following t-SNE score plot (in gray, QC samples):
If one would like to obtain a score plot without QC samples, this can be done through the following block of code:
The updated version of the t-SNE score plot: