> For the complete documentation index, see [llms.txt](https://laboratory-of-lipid-metabolism-a.gitbook.io/omics-data-visualization-in-r-and-python/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://laboratory-of-lipid-metabolism-a.gitbook.io/omics-data-visualization-in-r-and-python/metabolites-and-lipids-multivariate-statistical-analysis-in-python/principal-component-analysis.md).

# Principal Component Analysis

## Required packages

The required packages for this section are pandas, seaborn and scikit-learn. These can be installed with the following command in the command window (Windows) / terminal (Mac).

```
pip install pandas seaborn scikit-learn
```

## Loading the data

We will again use the demo lipidomics dataset:

{% file src="/files/O8GcvqGYOzpXcfU7NayN" %}

Load the dataset into a Pandas DataFrame named df as described in the basic plotting section:

```python
import pandas as pd
df = pd.read_excel("Lipidomics_dataset.xlsx", decimal=",")
df.set_index("Sample Name", inplace=True)
```

## PCA

### Data normalisation

The first step is to normalise the data such that the features (lipids) have zero mean and unit variance, this can easily be done with the StandardScaler from sklearn. By indexing the dataframe with df.iloc\[:,1:] we'll select all the data in the dataframe, except for the first column, which contains the labels:

```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_normalised = scaler.fit_transform(df.iloc[:,1:])
```

### PCA

Next, we use the PCA sklearn from *sklearn*, we'll select the 10 first principal components and we'll apply the PCA algorithm to the normalised data:

<pre class="language-python"><code class="lang-python">from sklearn.decomposition import PCA

<strong>n_components = 10
</strong>pca = PCA(n_components=n_components)
pca_features = pca.fit_transform(df_normalised)
</code></pre>

The PCA function from *sklearn* return the results as a *numpy* ndarray, let's put these results in a Pandas DataFrame:

```python
pca_features = pd.DataFrame(data= pca_features,
                      columns=[f"PC{i+1}" for i in range(n_components)],
                      index=df.index)
pca_features["Label"] = df.Label
```

We can now visualise the projection of the samples to the new feature space with a scatterplot:

```python
import seaborn as sns
import matplotlib.pyplot as plt

sns.scatterplot(x='PC1',
       y='PC2',
       data=pca_features,
       hue="Label",
       palette=["royalblue","orange", "red"]);
plt.show()
```

<figure><img src="/files/gXzNQ6arydVBkEkIJJza" alt=""><figcaption></figcaption></figure>

### Explained variance

We can visualise the explained variance of the first n (we chose 10) principal components with:

```python
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.set_xlabel("Components")
ax.set_ylabel("Explained Variance")
explained_var = pd.DataFrame(data=pca.explained_variance_ratio_, index=range(1,n_components+1))
explained_var.plot.bar(ax=ax)
ax.tick_params(axis='x', labelrotation=0)
plt.show();
```

<figure><img src="/files/3X0xdkl2w1vq4pbRqbdR" alt=""><figcaption></figcaption></figure>

### PCA loadings

We'll extract the loadings from the PCA results and put them in a DataFrame:

```python
loadings = pd.DataFrame(data= pca.components_.T,
                      columns=[f"PC{i+1}" for i in range(n_components)],
                      index=df.columns[1:])
```

Next we can visualise the loadings scores in a scatterplot, including the labels of the features for each datapoint:

```python
fig, ax = plt.subplots(figsize=(15,15))

p1 = sns.scatterplot(x='PC1',
       y='PC2',
       data=loadings,
       size = 15,
       ax=ax,
       legend=False)  

for line in range(0,loadings.shape[0]):
     p1.text(loadings.PC1.iloc[line]+0.002, loadings.PC2.iloc[line], 
     df.columns[line], horizontalalignment='left', 
     size='small', color='black')
plt.show();
```

<figure><img src="/files/e1jRP6fEHjxmvgITM22S" alt=""><figcaption></figcaption></figure>

## PCA on data with missing values

The standard PCA algorithm does not handle datasets with missing values. As an alternative to missing value imputation prior to PCA, there exist PCA variants that can deal with missing values directly. This approach of avoiding imputation of missing values directly can reduce bias. A popular PCA implementation that can handle missing values is Probabilistic Principal Component Analysis (PPCA), first described by Michael E. Tipping and Christopher M. Bishop (*Neural Computation* (1999) 11 (2): 443–482.).

### Required packages

The required packages for this section are pandas, seaborn and scikit-learn. These can be installed with the following command in the command window (Windows) / terminal (Mac).

```
pip install pandas seaborn scikit-learn
```

In addition we need to install the pyppca package from github:

```
pip install git+https://github.com/el-hult/pyppca.git
```

### Loading the data

We will use the demo lipidomics dataset with missing values:

{% file src="/files/WATiUqJSqFrormVPJd7r" %}

### Data normalisation

The first step is to normalise the data such that the features (lipids) have zero mean and unit variance, this can easily be done with the StandardScaler from sklearn. By indexing the dataframe with df.iloc\[:,1:] we'll select all the data in the dataframe, except for the first column, which contains the labels:

```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_normalised = scaler.fit_transform(df.iloc[:,1:])
```

### PPCA

Next, we use the PPCA function from *pyppca*, we'll select the 10 first principal components and we'll apply the PCA algorithm to the normalised data. The first argument of the ppca function is a numpy array that contains the data, the second argument is the number of principal components, and the final argument is a boolean that indicates if algorithmic run details should be printed:

```python
n_components = 10
C, ss, M, X, Ye = ppca(df_normalised, n_components, False)
```

The returned variables are:

<pre><code>ss: ( float ) isotropic variance outside subspace
<strong>C:  (D by d ) C*C' + I*ss is covariance model, C has scaled principal directions as cols
</strong>M:  (D by 1 ) data mean
X:  (N by d ) expected states
Ye: (N by D ) expected complete observations (differs from Y if data is missing)
</code></pre>

The PPCA function from *pyppca* return the results as a *numpy* ndarray in the X variable, let's put these results in a Pandas DataFrame:

```python
pca_features = pd.DataFrame(data= X,
                      columns=[f"PC{i+1}" for i in range(n_components)],
                      index=df.index)
pca_features["Label"] = df.Label
```

We can now visualise the projection of the samples to the new feature space with a scatterplot:

```python
import seaborn as sns
import matplotlib.pyplot as plt

sns.scatterplot(x='PC1',
       y='PC2',
       data=pca_features,
       hue="Label",
       palette=["royalblue","orange", "red"]);
plt.show()
```

<figure><img src="/files/Q5ghHoqcwuvW9NmjN2ds" alt=""><figcaption></figcaption></figure>

### Explained variance

We can visualise the explained variance of the first n (we chose 10) principal components with:

```python
import matplotlib.pyplot as plt

def explained_variance(C, Ye):
    total_variance = np.trace(np.cov(Ye.T))
    covM = np.cov(np.dot(Ye, C).T)
    eigenvalues, _ = np.linalg.eig(covM)
    eigenvalues = np.sort(eigenvalues)[::-1]
    explained_variance_ratio = eigenvalues / total_variance
    return explained_variance_ratio

fig, ax = plt.subplots()
ax.set_xlabel("Components")
ax.set_ylabel("Explained Variance")
explained_var = pd.DataFrame(data=explained_variance(C, Ye), index=range(1,n_components+1))
explained_var.plot.bar(ax=ax)
ax.tick_params(axis='x', labelrotation=0)
plt.show();
```

<figure><img src="/files/rgRcki1GAi7GRiln4kUD" alt=""><figcaption></figcaption></figure>

### PCA loadings

We'll extract the loadings from the PCA results and put them in a DataFrame:

```python
def get_loadings(C, Ye):
    covM = np.cov(np.dot(Ye, C).T)
    eigenvalues, _ = np.linalg.eig(covM)
    eigenvalues = np.sort(eigenvalues)[::-1]
    return C
    
loadings = pd.DataFrame(data= get_loadings(C, Ye),
                      columns=[f"PC{i+1}" for i in range(n_components)],
                      index=df.columns[1:])
```

Next we can visualise the loadings scores in a scatterplot, including the labels of the features for each datapoint:

```python
fig, ax = plt.subplots(figsize=(15,15))

p1 = sns.scatterplot(x='PC1',
       y='PC2',
       data=loadings,
       size = 15,
       ax=ax,
       legend=False)  

for line in range(0,loadings.shape[0]):
     p1.text(loadings.PC1.iloc[line]+0.002, loadings.PC2.iloc[line], 
     df.columns[line], horizontalalignment='left', 
     size='small', color='black')
plt.show();
```

<figure><img src="/files/HYrmxmKFea8prLJ2b2GN" alt=""><figcaption></figcaption></figure>


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://laboratory-of-lipid-metabolism-a.gitbook.io/omics-data-visualization-in-r-and-python/metabolites-and-lipids-multivariate-statistical-analysis-in-python/principal-component-analysis.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
