# t-Distributed Stochastic Neighbor Embedding

## Required packages

The required packages for this section are pandas, seaborn and scikit-learn. These can be installed with the following command in the command window (Windows) / terminal (Mac).

```python
pip install pandas seaborn scikit-learn
```

## Loading the data

For this section the following dataset will be used:

{% file src="<https://1939159422-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FG1KwOmfbAgmKed8MHinY%2Fuploads%2F9Ayu3LALEGdrShIGwlcL%2FGOUT_CTRL_QC_Ales_data_31012025.xlsx?alt=media&token=8674db36-6c2f-44fb-9894-2e6c05cf5aa6>" %}

Load the dataset into a Pandas DataFrame named df as described in the basic plotting section:

```python
import pandas as pd
df = pd.read_excel("GOUT_CTRL_QC_Ales_data_31012025.xlsx", decimal=",")
df.set_index("Sample Name", inplace=True)
```

## PCA

Usually before running t-SNE, PCA is performed on the high-dimensional data to reduce the dimensions (e.g. 30, 50) and then t-SNE is applied on the derived Principal Components.

### Data normalisation

The first step is to normalise the data such that the features (lipids) have zero mean and unit variance, this can easily be done with the StandardScaler from sklearn. By indexing the dataframe with df.iloc\[:,1:] we'll select all the data in the dataframe, except for the first column, which contains the labels:

```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_normalised = scaler.fit_transform(df.iloc[:,1:])
```

### PCA

Next, we use the PCA sklearn from *sklearn*, we'll select the 50 first principal components and we'll apply the PCA algorithm to the normalised data:

```python
from sklearn.decomposition import PCA

n_components = 50
pca = PCA(n_components=n_components)
pca_features = pca.fit_transform(df_normalised)
```

## t-SNE

Then using t-SNE from *sklearn* we'll apply the t-SNE algrorithm on the 50 Principal Components.

```python
from sklearn.manifold import TSNE

n_components = 2
tsne = TSNE(n_components=n_components)
tsne_features = tsne.fit_transform(pca_features)
```

The TSNE function from *sklearn* return the results as a *numpy* ndarray, let's put these results in a Pandas DataFrame:

```python
tsne_features = pd.DataFrame(data=tsne_features,
                      columns=[f"t-SNE{i+1}" for i in range(n_components)],
                      index=df.index)
tsne_features["Label"] = df.Label
```

We can now visualise the projection of the samples to the new feature space with a scatterplot:

```python
import seaborn as sns
import matplotlib.pyplot as plt

sns.scatterplot(x='t-SNE1',
       y='t-SNE2',
       data=tsne_features,
       hue="Label",
       palette=["red", "royalblue"]);
plt.show()
```

<figure><img src="https://1939159422-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FG1KwOmfbAgmKed8MHinY%2Fuploads%2FmZ6A1zDs1ieuyN2SqbLT%2Fgitbook_tsne.svg?alt=media&#x26;token=ff06ce95-c496-4406-9b15-b8b4bc208f44" alt=""><figcaption></figcaption></figure>
