💪
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  • Required packages
  • Loading the data
  • Barplots
  • Boxplots
  • Histograms
  • Density plots
  • Saving plots to image file
  1. Metabolites and lipids descriptive statistics analysis in Python

Basic plotting

PreviousGGally for quick overviewsNextScatter plots and linear regression

Last updated 2 months ago

Required packages

The required packages for this section are pandas, matplotlib and seaborn. These can be installed with the following command in the command window (Windows) / terminal (Mac).

pip install pandas matplotlib seaborn

Loading the data

The most straightforward way to load and manipulate tabular data in python is through the Pandas library. Pandas can load data from a range of different formats, the most commonly used are Excel or .csv files. In this tutorial we will use the lipidomics demo dataset (in excel format), which you can download with the link below.

Place the downloaded Lipidomics_dataset.xlsx file in the same folder as your JupyterLab script. Then run the following code in Jupyter:

import pandas as pd
df = pd.read_excel("Lipidomics_dataset.xlsx", decimal=",")
df.head()

The first line imports the pandas library and gives you access to this library through the alias "pd". On the second line we use the read_excel function of the pandas library, to which we pass between quotes the name of the file we want to load, or the full path to the file if it is not stored in the same folder as the Jupyter script. The loaded data is stored as a DataFrame in the df variable.

On the third line we call the head() function on the Dataframe object that holds our data, which will display the first 5 rows of the table. It should look like this:

The rows of the table correspond the samples, the columns to lipid species, except for the first two columns which contain the unique sample IDs and the sample labels (to which group samples belong). A handy improvement that allows us to easily access the data of individual samples by their unique ID, is to specify that the column "Sample Name" should be used as the index column:

df.set_index("Sample Name", inplace=True)

Now if we want to access the data or for example sample 1a2, this can be done with:

df.loc["1a2"]

The complete code for loading the data:

import pandas as pd

df = pd.read_excel("Lipidomics_dataset.xlsx")
df.set_index("Sample Name", inplace=True)

Barplots

The pandas package has a number of built-in functions for the plotting of basic graphs. Under the hood pandas is relying on the matplotlib package for plotting, and to have more control over how to plot the data, we will also work with matplotlib directly. Now that the data is loaded and ready, let's load matplotlib:

import matplotlib.pyplot as plt

Next, we define which lipid we want to plot, we group our DataFrame by the Label variable (which defines to which groups the samples belong) and we calculate the mean and standard deviation. If the standard error is desired instead, std() can be replaced with sem() instead.

lipid = "CE 16:1"
df_group = df.groupby(["Label"])
means = df_group.mean()[lipid]
errors = df_group.std()[lipid]

Next, we load the fig and ax objects from matplotlib, which allows us to customize the titles (among many other parameters such as colors, line thickness, etc..., for which we refer to the matplotlib documentation. Finally, we use the plot.bar() function of the mean DataFrame, and pass in as arguments the errors, the customized axis with its titles (ax), and we define the size of the error whiskers (capsize=4), rotation of the labels (rot=0) and finally we pass in a list of color names.

fig, ax = plt.subplots()
ax.set_ylabel("Concentration (nmol / mL)")
ax.set_title(lipid)
means.plot.bar(yerr=errors, ax=ax, capsize=4, rot=0, 
                color=["royalblue","crimson", "orange"]);
plt.show()

We should get something that looks like this:

The complete code to make a barplot from a loaded pandas DataFrame:

import matplotlib.pyplot as plt

lipid = "CE 16:1"
df_group = df.groupby(["Label"])
means = df_group.mean()[lipid]
errors = df_group.std()[lipid]

fig, ax = plt.subplots()
ax.set_ylabel("Concentration (nmol / mL)")
ax.set_title(lipid)
means.plot.bar(yerr=errors, ax=ax, capsize=4, rot=0);
plt.show()

Boxplots

For boxplots, the built-in plotting capabilities of the pandas package are quite powerful. We can simply call the plot.box() function on our DataFrame. In this function we pass a list of the lipid IDs that we want to have plotted, and the y-axis title:

df.plot.box(
    column=['CE 16:1','TG 50:2', 'SM 42:2;O2'], 
    ylabel="Concentration (nmol / mL)");

The results should look like this:

By passing "Label" to the "by" argument, we can plot the species separately for the different Labels of the samples:

plot = df.plot.box(
    column=['CE 16:1','TG 50:2'], 
    ylabel="Concentration (nmol / mL)", by="Label");

And to get more control over the appearance of the box plot, we can switch to seaborn and define our preferred colors:

import seaborn as sns
my_pal = {"N": "royalblue", "PAN": "orange", "T":"red"}
sns.boxplot(data=df, x="Label", y='CE 16:1', palette=my_pal);
plt.show()

Histograms

Creating a histogram with Pandas is straightforward as well, we'll just need to load the ax object from matplotlib again to customize the axis titles:

fig, ax = plt.subplots()
ax.set_xlabel("Concentration (nmol / mL)")
ax.set_title("CE 16:1")
df["CE 16:1"].plot.hist(ax=ax);
plt.show()

And to show multiple groups with custom colors we can use seaborn:

import seaborn as sns
pal = {"N": "royalblue", "PAN": "orange", "T":"red"}
sns.displot(df, x="CE 16:1", hue="Label", binwidth=50, palette=pal);
plt.show()

Density plots

Creating density plots is highly similar to creating histograms:

fig, ax = plt.subplots()
ax.set_xlabel("Concentration (nmol / mL)")
ax.set_title("CE 16:1")
df["CE 16:1"].plot.kde(ax=ax);
plt.show()

Or with seaborn:

import seaborn as sns
pal = {"N": "royalblue", "PAN": "orange", "T":"red"}
sns.displot(df, x="CE 16:1", hue="Label", palette=pal, kind="kde");
plt.show()

Saving plots to image file

Plots made with Pandas, Seaborn or Matplotlib can all be saved by running the following command after the creation of the plot:

plt.savefig('figure.png', dpi=300, bbox_inches='tight')

This requires that pyplot from matplotlib is imported:

import matplotlib.pyplot as plt

Instead of .png, vector formats such as .svg or .pdf can also be used, such that the figures can be edited in programs like Inkscape or Illustrator.

plt.savefig('figure.svg', dpi=300, bbox_inches='tight')
plt.savefig('figure.pdf', dpi=300, bbox_inches='tight')
348KB
Lipidomics_dataset.xlsx