💪
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  • Required packages
  • Loading the data
  • Clustered heatmaps
  • Data normalisation
  • Heatmap
  • Selecting a subset of interesting species
  1. Metabolites and lipids multivariate statistical analysis in Python

Clustered heatmaps

PreviousPLS Discriminant AnalysisNextApplication of selected models to OMICs data

Last updated 7 months ago

Required packages

The required packages for this section are pandas and seaborn. These can be installed with the following command in the command window (Windows) / terminal (Mac).

pip install pandas seaborn

Loading the data

Like in the other sections we will use the lipidomics demo dataset:

import pandas as pd
df = pd.read_excel("Lipidomics_dataset.xlsx", decimal=",")
df.set_index("Sample Name", inplace=True)

Clustered heatmaps

Data normalisation

The first step is to normalise the data such that the features (lipids) have zero mean and unit variance, this can easily be done with the StandardScaler from sklearn. By indexing the dataframe with df.iloc[:,1:] we'll select all the data in de dataframe, except for the first column, which contains the labels:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_normalised = scaler.fit_transform(df.iloc[:,1:])
df_normalised = pd.DataFrame(df_normalised, index=df.index, columns=df.columns[1:])

Heatmap

To obtain a clustered heatmap, we can sue the clustermap function from seaborn:

import seaborn as sns
import matplotlib.pyplot as plt

sns.clustermap(df_normalised, 
               figsize=(40,60),
               method="centroid",
               metric="euclidean",
               row_cluster=True,
               col_cluster=True);
               
plt.show()

We can save the heatmap to a file with:

import matplotlib.pyplot as plt
plt.savefig('python_clusterhmap.png', dpi=300, bbox_inches='tight')

Note: depending on the size of the dataset, you may get a warning stating that "Installing fastcluster may give better performance". This warning can be ignored, or alternatively if you do experience the clustering to be too slow you can install the fastcluster package with "pip install fastcluster".

We can also add an extra color map to indicate the Labels of the samples:

lut = dict(zip(df.Label.unique(), "rbg"))
row_colors = df.Label.map(lut)

sns.clustermap(df_normalised, 
               figsize=(40,60),
               method="centroid",
               metric="euclidean",
               row_cluster=True,
               col_cluster=True,
               row_colors=row_colors
              );
plt.show()

Selecting a subset of interesting species

A more compact and visually more interesting heatmap can be obtained by filtering for more interesting lipid species, like for example species that are significantly different between the groups. Here we will do an ANOVA test and only keep species with p < 0.0001.

We adapt our previous ANOVA code test to run a loop and perform the ANOVA for each species. First we'll replace any special character with an underscore, since statsmodels does not currently accept these. During this loop we'll store the p-values in a Pandas.Series object named p_values. Next we select all p-values smaller than 0.0001 and use this to index our original DataFrame.

from statsmodels.formula.api import ols
from statsmodels.api import stats

df2 = df.copy()
df2.columns = df2.columns.str.rstrip()
df2.columns = df2.columns.str.replace(' ', '_')
df2.columns = df2.columns.str.replace(':', '_')
df2.columns = df2.columns.str.replace(';', '_')
df2.columns = df2.columns.str.replace('/', '_')
df2.columns = df2.columns.str.replace('-', '_')

p_values = pd.Series(dtype=float)
for species in df2.columns[1:]:
    model = ols(f"{species} ~ Label", data=df2).fit()
    ow_anova_table = stats.anova_lm(model, typ=2)
    p_value = ow_anova_table.loc["Label", "PR(>F)"]
    p_values[species] = p_value
    
index = (p_values < 0.0001).values
df2 = df.iloc[:,1:].loc[:,index]

Like previously, we'll preform standard scaling on this data and we use the clustermap from Seaborn to draw the heatmap:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_normalised = scaler.fit_transform(df2)
df_normalised = pd.DataFrame(df_normalised, index=df.index, columns=df2.columns)

sns.clustermap(df_normalised, 
               figsize=(15,60),
               method="centroid",
               metric="euclidean",
               row_cluster=True,
               col_cluster=True);
plt.show()

We set the figsize to values that result in the labels being shown for each row and colunm, with method we configure how the distance between clusters is calculated, and with metric how the distance between 2 n-dimensional vectors is calculated. All possible options are described in the Scipy documentation for and . Row or column clustering can optionally bet set to False.

method
metric
348KB
Lipidomics_dataset.xlsx