💪
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  • Required packages
  • Loading the data
  • PCA
  • Data normalisation
  • PCA
  • Explained variance
  • PCA loadings
  • PCA on data with missing values
  • Required packages
  • Loading the data
  • Data normalisation
  • PPCA
  • Explained variance
  • PCA loadings
  1. Metabolites and lipids multivariate statistical analysis in Python

Principal Component Analysis

PreviousInteractive heat mapsNextt-Distributed Stochastic Neighbor Embedding

Last updated 7 months ago

Required packages

The required packages for this section are pandas, seaborn and scikit-learn. These can be installed with the following command in the command window (Windows) / terminal (Mac).

pip install pandas seaborn scikit-learn

Loading the data

We will again use the demo lipidomics dataset:

Load the dataset into a Pandas DataFrame named df as described in the basic plotting section:

import pandas as pd
df = pd.read_excel("Lipidomics_dataset.xlsx", decimal=",")
df.set_index("Sample Name", inplace=True)

PCA

Data normalisation

The first step is to normalise the data such that the features (lipids) have zero mean and unit variance, this can easily be done with the StandardScaler from sklearn. By indexing the dataframe with df.iloc[:,1:] we'll select all the data in the dataframe, except for the first column, which contains the labels:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_normalised = scaler.fit_transform(df.iloc[:,1:])

PCA

Next, we use the PCA sklearn from sklearn, we'll select the 10 first principal components and we'll apply the PCA algorithm to the normalised data:

from sklearn.decomposition import PCA

n_components = 10
pca = PCA(n_components=n_components)
pca_features = pca.fit_transform(df_normalised)

The PCA function from sklearn return the results as a numpy ndarray, let's put these results in a Pandas DataFrame:

pca_features = pd.DataFrame(data= pca_features,
                      columns=[f"PC{i+1}" for i in range(n_components)],
                      index=df.index)
pca_features["Label"] = df.Label

We can now visualise the projection of the samples to the new feature space with a scatterplot:

import seaborn as sns
import matplotlib.pyplot as plt

sns.scatterplot(x='PC1',
       y='PC2',
       data=pca_features,
       hue="Label",
       palette=["royalblue","orange", "red"]);
plt.show()

Explained variance

We can visualise the explained variance of the first n (we chose 10) principal components with:

import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.set_xlabel("Components")
ax.set_ylabel("Explained Variance")
explained_var = pd.DataFrame(data=pca.explained_variance_ratio_, index=range(1,n_components+1))
explained_var.plot.bar(ax=ax)
ax.tick_params(axis='x', labelrotation=0)
plt.show();

PCA loadings

We'll extract the loadings from the PCA results and put them in a DataFrame:

loadings = pd.DataFrame(data= pca.components_.T,
                      columns=[f"PC{i+1}" for i in range(n_components)],
                      index=df.columns[1:])

Next we can visualise the loadings scores in a scatterplot, including the labels of the features for each datapoint:

fig, ax = plt.subplots(figsize=(15,15))

p1 = sns.scatterplot(x='PC1',
       y='PC2',
       data=loadings,
       size = 15,
       ax=ax,
       legend=False)  

for line in range(0,loadings.shape[0]):
     p1.text(loadings.PC1.iloc[line]+0.002, loadings.PC2.iloc[line], 
     df.columns[line], horizontalalignment='left', 
     size='small', color='black')
plt.show();

PCA on data with missing values

The standard PCA algorithm does not handle datasets with missing values. As an alternative to missing value imputation prior to PCA, there exist PCA variants that can deal with missing values directly. This approach of avoiding imputation of missing values directly can reduce bias. A popular PCA implementation that can handle missing values is Probabilistic Principal Component Analysis (PPCA), first described by Michael E. Tipping and Christopher M. Bishop (Neural Computation (1999) 11 (2): 443–482.).

Required packages

The required packages for this section are pandas, seaborn and scikit-learn. These can be installed with the following command in the command window (Windows) / terminal (Mac).

pip install pandas seaborn scikit-learn

In addition we need to install the pyppca package from github:

pip install git+https://github.com/el-hult/pyppca.git

Loading the data

We will use the demo lipidomics dataset with missing values:

Data normalisation

The first step is to normalise the data such that the features (lipids) have zero mean and unit variance, this can easily be done with the StandardScaler from sklearn. By indexing the dataframe with df.iloc[:,1:] we'll select all the data in the dataframe, except for the first column, which contains the labels:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_normalised = scaler.fit_transform(df.iloc[:,1:])

PPCA

Next, we use the PPCA function from pyppca, we'll select the 10 first principal components and we'll apply the PCA algorithm to the normalised data. The first argument of the ppca function is a numpy array that contains the data, the second argument is the number of principal components, and the final argument is a boolean that indicates if algorithmic run details should be printed:

n_components = 10
C, ss, M, X, Ye = ppca(df_normalised, n_components, False)

The returned variables are:

ss: ( float ) isotropic variance outside subspace
C:  (D by d ) C*C' + I*ss is covariance model, C has scaled principal directions as cols
M:  (D by 1 ) data mean
X:  (N by d ) expected states
Ye: (N by D ) expected complete observations (differs from Y if data is missing)

The PPCA function from pyppca return the results as a numpy ndarray in the X variable, let's put these results in a Pandas DataFrame:

pca_features = pd.DataFrame(data= X,
                      columns=[f"PC{i+1}" for i in range(n_components)],
                      index=df.index)
pca_features["Label"] = df.Label

We can now visualise the projection of the samples to the new feature space with a scatterplot:

import seaborn as sns
import matplotlib.pyplot as plt

sns.scatterplot(x='PC1',
       y='PC2',
       data=pca_features,
       hue="Label",
       palette=["royalblue","orange", "red"]);
plt.show()

Explained variance

We can visualise the explained variance of the first n (we chose 10) principal components with:

import matplotlib.pyplot as plt

def explained_variance(C, Ye):
    total_variance = np.trace(np.cov(Ye.T))
    covM = np.cov(np.dot(Ye, C).T)
    eigenvalues, _ = np.linalg.eig(covM)
    eigenvalues = np.sort(eigenvalues)[::-1]
    explained_variance_ratio = eigenvalues / total_variance
    return explained_variance_ratio

fig, ax = plt.subplots()
ax.set_xlabel("Components")
ax.set_ylabel("Explained Variance")
explained_var = pd.DataFrame(data=explained_variance(C, Ye), index=range(1,n_components+1))
explained_var.plot.bar(ax=ax)
ax.tick_params(axis='x', labelrotation=0)
plt.show();

PCA loadings

We'll extract the loadings from the PCA results and put them in a DataFrame:

def get_loadings(C, Ye):
    covM = np.cov(np.dot(Ye, C).T)
    eigenvalues, _ = np.linalg.eig(covM)
    eigenvalues = np.sort(eigenvalues)[::-1]
    return C
    
loadings = pd.DataFrame(data= get_loadings(C, Ye),
                      columns=[f"PC{i+1}" for i in range(n_components)],
                      index=df.columns[1:])

Next we can visualise the loadings scores in a scatterplot, including the labels of the features for each datapoint:

fig, ax = plt.subplots(figsize=(15,15))

p1 = sns.scatterplot(x='PC1',
       y='PC2',
       data=loadings,
       size = 15,
       ax=ax,
       legend=False)  

for line in range(0,loadings.shape[0]):
     p1.text(loadings.PC1.iloc[line]+0.002, loadings.PC2.iloc[line], 
     df.columns[line], horizontalalignment='left', 
     size='small', color='black')
plt.show();
348KB
Lipidomics_dataset.xlsx
248KB
Matrix_missing_values_EXAMPLE.xlsx