💪
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  • Required packages
  • Loading the data
  • Basic scatterplots
  • Linear regression
  1. Metabolites and lipids descriptive statistics analysis in Python

Scatter plots and linear regression

PreviousBasic plottingNextCorrelation analysis

Last updated 7 months ago

Required packages

The required packages for this section are pandas, seaborn and statsmodels. These can be installed with the following command in the command window (Windows) / terminal (Mac).

pip install pandas seaborn statsmodels

Loading the data

We will again use the demo lipidomics dataset:

Load the dataset into a Pandas DataFrame named df as described in the basic plotting section:

import pandas as pd
df = pd.read_excel("Lipidomics_dataset.xlsx", decimal=",")
df.set_index("Sample Name", inplace=True)

Basic scatterplots

Creating a basic scatterplot with Pandas is as simple as:

df.plot.scatter(x="TG 50:3", y="TG 50:4");

Related lipid species tend to show high correlation! Lets check if this behaviour is consistent within our sample subgroups, by passing the Label column into the c parameter of the scatter function. Before we can do this, we must make sure that the Label column is converted to panda's categorical data type:

df["Label"] = df["Label"].astype("category")
df.plot.scatter(x="TG 50:3", y="TG 50:4", c="Label", colormap='viridis');

There does seem to be a slight difference in the way these 2 lipid species correlate in the different sample groups. We'll investigate correlations in more detail in the chapter on correlation analysis.

Linear regression

Currently Pandas does not offer an option for adding linear regression to scatterplots. Fortunately, the Seaborn package makes this super easy:

import seaborn as sns
import matplotlib.pyplot as plt
sns.lmplot(data=df, x="TG 50:3", y="TG 50:4", ci=None);
plt.show()

If we want to see the confidence intervals, just leave out "ci=None":

And to get separate regressions for the different sample groups, we can just pass "Label" to the hue parameter:

sns.lmplot(data=df,x="TG 50:3", y="TG 50:4", hue="Label");
plt.show()

For a more formal regression analysis, with estimation of R-squared, the alpha and beta parameters, and calculation of statistical significance, the statsmodels package offers everything we need.

import statsmodels.formula.api as sm

df2 = df.copy()
df2.columns = df2.columns.str.rstrip()
df2.columns = df2.columns.str.replace(' ', '_')
df2.columns = df2.columns.str.replace(':', '_')

mod = sm.ols(formula="CE_16_1 ~ CE_16_0", data=df2)
res = mod.fit()
print(res.summary())

Let's go through the above code step by step. Statsmodels does not play nice with white spaces or special characters in the column (lipids) names, so we'll make a copy of our DataFrame named df2 (code line 3) in which we remove any trailing whitespaces (line 4), we substitute spaces and colons with an underscore (lines 5 and 6). Next we'll use the ordinary least squares regression model from statsmodels (sm.ols) and we'll pass in a formula and the DataFrame we just created. The syntax of the formula is similar to R, where we first state the dependent variable, separated by a tilde symbol "~" from the independent variable(s). Finally we call the fit() method on the model and print the summary results:

From the summary we can read out that the alpha value (the intercept) is -1.8748 and the beta is 0.2208. Both of these parameters have a p-value smaller then 0.001 (The exact p-values can be obtained with res.pvalues) and the R-squared is 0.874.

348KB
Lipidomics_dataset.xlsx