💪
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  • Data set no. 1
  • Data set no. 2
  1. Introduction

Example data sets

Data set for computations and visualizations used in this Gitbook

PreviousGetting started with RNextFundamental data structures

Last updated 8 months ago

Data set no. 1

In this Gitbook, all computations and visualizations will be prepared based on an example serum lipidomics data set containing concentrations of 127 lipid species measured for 97 healthy volunteers (N), 21 patients with pancreatitis (PAN), and 109 patients with pancreatic cancer (PDAC, or T). To simplify further steps, we will assume that the data frame used for all computations and visualizations contains concentrations of an individual lipid for all patients in columns (1 column = all concentrations for 1 lipid) and concentrations of all lipid species for one patient in rows (1 row = all lipid concentrations for 1 patient). Such an Excel table must be loaded in R/Python for data analysis and visualization. You will find instructions on how to load your data in R/Python in the next subchapters.

All human serum samples were obtained from the Bank of Biological Material in Masaryk Memorial Cancer Institute in Brno (PDAC, healthy volunteers, pancreatitis) and the First and Third Faculty of Medicine at Charles University in Prague (pancreatitis), approved by the institutional ethical committees, and all blood donors gave informed consent. The sample selection was based on the availability of stored serum samples. The only exclusion criterion for healthy controls was the presence of malignant disease in the lifetime history, without any other exclusion criteria for other diseases. For all PDAC patients, the disease was confirmed by abdominal computed tomography and/or endoscopic ultrasound followed by needle biopsy or surgical resection. 21 patients with chronic pancreatitis treated at two outpatient departments were included. The pancreatitis was either ethanol-induced or recurrent acute pancreatitis, and it was confirmed by imaging methods (endoscopic ultrasound or endoscopic retrograde cholangiopancreatography). All involved institutes provided ethical approval and signed informed consent for blood collection. All PDAC patients, pancreatitis patients, and healthy controls were of Caucasian ethnicity.

Obtained serum samples were stored at −80 °C for further processing.

The lipid concentrations in the serum of PDAC patients and healthy controls were published in the following manuscript after normalization to the NIST plasma standard (NIST SRM 1950):

First, please download the following data set and have a look at it:

Data set no. 2

Data set no. 2 was created using Data set no. 1. Missing values were introduced into the PDAC lipidomics data set using the R programming language.

In columns, lipid concentrations, and in rows - concentrations for each patient.

https://doi.org/10.1007/s00216-022-04490-w
348KB
Lipidomics_dataset.xlsx
Exemplary data set no. 1 which will be used in this Gitbook for data analysis and visualization.
248KB
Lipidomics_missing_values_EXAMPLE.xlsx
Exemplary data set no. 2 which will be used in this Gitbook for missing values imputation.