💪
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  • Practical applications of PCA (examples)
  • Considerations - Datasets with Missing Values (NAs)
  • Package Installation and Loading
  • Data Transformation and Formatting
  • Calculating PCA and sPCA in mixOmics
  • Visualising PCA and sPCA Data from mixOmics
  • Calculating and Visualising PCA Data with FactoMineR and factoextra
  • The reconstruction of the PCA analysis presented in the manuscript (advanced version only)
  1. Metabolites and lipids multivariate statistical analysis in R

Principal Component Analysis (PCA)

Metabolites and lipids multivariate statistical analysis in R

PreviousStatistical annotations on plotsNextt-Distributed Stochastic Neighbor Embedding (t-SNE)

Last updated 8 months ago

A principal component analysis (PCA) is an excellent dimensionality reduction technique that can be used to extract and visualise variance in a lipidomics and metabolomics dataset without knowledge of class labels.

Practical applications of PCA (examples)

PCA is among the most often utilized multivariate statistical methods in metabolomics and lipidomics, primarily for the determination of sources of variance but also for quality control. Check out the following examples below:

  • A. Kvasnička et al. Alterations in lipidome profiles distinguish early-onset hyperuricemia, gout, and the effect of urate-lowering treatment. DOI: - Fig. 1A Fig. 5 (the authors utilize PCA for comparing healthy controls and patients' lipid profiles, e.g., they observe the separation of controls from patients along PC1 in all 3 PCA score plots based on plasma lipid composition; morover, they present measurements integrity - clustering of QC samples).

  • D. Olešová et al. Changes in lipid metabolism track with the progression of neurofibrillary pathology in tauopathies. DOI: - Fig. 3A, Fig. 4A, Fig. 5H (the authors use PCA analysis for presenting distinct metabolic patterns between experimental groups).

  • J. Idkowiak et al. Robust and high-throughput lipidomic quantitation of human blood samples using flow injection analysis with tandem mass spectrometry for clinical use. DOI: - Fig. 7B - C, Fig. 8A (the authors employ PCA analysis for comparing results obtained by different analytical methods after batch effect removal; in the case of Fig. 8A - authors present that second source of variance separates PDAC patients from healthy controls along PC2, as well as analytical method integrity by clustering of QC samples).

  • R. C. Prior, A. Silva & T. Vangansewinkel et al. PMP22 duplication dysregulates lipid homeostasis and plasma membrane organization in developing human Schwann cells. DOI: - e.g., Fig. 2A (the authors utilize PCA for comparing lipid profiles of sciatic nerves from C3 mice versus their wild-type (WT) littermate controls at 5 weeks of age).

  • Y. Yu et al. Unravelling lipidomic disruptions across multiple tissues in Chd8-mutant ASD mice through integration of lipidomics and single-cell transcriptomics. DOI: - Fig. 1b, Fig. 2h, (the authors of the study published in Gut utilize PCA analysis for analysis of similarities and differences between experimental groups).

  • D. Wolrab et al. Lipidomic profiling of human serum enables detection of pancreatic cancer. DOI: - Fig. 5a, Fig. 6a (the authors utilize PCA for studying similarities and differences in lipid profiles between PDAC patients and healthy controls).

  • D. Hornburg et al. Dynamic lipidome alterations associated with human health, disease and ageing. DOI: - Fig. 3a (the authors apply PCA for comparing two experimental groups (IR and IS), based on lipidome).

  • N. De Silva Mohotti et al. Lipidomic Analysis Reveals Differences in the Extent of Remyelination in the Brain and Spinal Cord. DOI: - Fig. 4, Fig. 5, Fig. 8 & Fig. 9 (the authors used PCA for comparing lipidomes among different experimental groups in a study dedicated to remyelination in the brain and spinal cord, i.e., overlapping or separation of biological groups in the PCA score plot).

There are several packages available in R that include functions capable of computing and often visualising PCA data such as; prcomp() from the base R stats package, PCA() from the FactoMineR package and pca() from the mixOmics package. Here we will focus on the implementation of the FactoMineR and the mixOmics package for the following reasons. First, mixOmics provides an easily customisable plotting function with plotIndiv() that can be used to plot individuals from PCA data in addition to several other informative plots. Similarly, the factoextra package provides plotting functions that can be used to extract PCA data directly from the PCA() output of FactoMineR. The mixOmics package also offers the Sparse PCA function spca(), which is an alternative to regular PCA and is suitable for large omics datasets. More information on Sparce PCA can be found on the mixOmics webpage ().

Considerations - Datasets with Missing Values (NAs)

As described earlier in this GitBook, missing values are common in lipidomics and metabolomics datasets generated by mass spectrometry based methods. For many statistical tests and models, missing values must be dealt with prior to the analysis. As described in the Missing Values Handling in R section, features with a large percentage of NA values can be filtered based on some pre-selected criteria. Alternatively, imputing missing values based on strategies such as k-nearest neighbours (kNN), or some percentage of the lowest concentration for each feature, is a valid approach that ensures each feature in the dataset is retained.

PCA analysis is one such example that is sensitive to missing values. The prcomp() and PCA() from the stats and FactoMineR packages reviewed below require datasets without missing values. If data containing missing values are given to the above functions in R, the following messages will be printed in the console.

Executing the prcomp() function from the stats package:

Error in svd(x, nu = 0, nv = k) : infinite or missing values in 'x'

While the PCA() function from FactoMineR package will only produce a warning message, however imputation will be performed for you based on the mean of each variable.

Warning message:
In PCA(X = pdac_NAs[, 3:ncol(pdac_NAs)], scale.unit = T, ncp = 5,  :
  Missing values are imputed by the mean of the variable: you should use the imputePCA function of the missMDA package

Package Installation and Loading

Ensure the mixOmics, factoextra and FactoMineR packages are installed and loaded. For data transformation and formatting, functions from dplyr will also be used. It should be noted that loading the mixOmics package will override the use of the select() function from dplyr. As such, calling the function using the dplyr::select() syntax will be required.

# First, we set our working directory (wd) containing our PDAC data set:
# setwd("dir")
# For example, if we have a folder named "Data analysis" on the D: drive, then:
setwd("D:/Data analysis")

# Then, we install mixOmics through Bioconductor: 
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("mixOmics")

# Install factoextra and FactoMineR from CRAN:
install.packages("factoextra") # For extracting and visualising PCA data
install.packages("FactoMineR") # For calculating PCA

# Load tidyverse and mixOmics
library(tidyverse)
library(mixOmics)
library(factoextra)
library(FactoMineR)

Data Transformation and Formatting

Here, we will again use the example PDAC lipidomics dataset to showcase both the mixOmics and factoextra packages. Additionally, to display the spca() and pca() functions capability of handling NA values, we will create an additional dataset with 15% of the data randomly replaced with NA.

# Load in data
pdac_raw <- readxl::read_xlsx(file.choose())

# Confirm presence of any NA values
which(is.na(pdac_raw))

# Randomly replace 15% of values with NA and assign them to a new data frame
pdac_NAs <- pdac_raw %>% 
  mutate(across(-c(1,2),
                ~ ifelse(test = runif(n()) < 0.15, NA, .))) 
                
# Assign non-NA data to a new dataframe
pdac_full <- pdac_raw

# We will also perform PCA for PDAC patients vs. healthy controls without PAN patients:
pdac_no_PAN <- pdac_raw %>%
  filter(Label != "PAN")
  
# Simirarly for the data set with 15% missing values (NA):
pdac_NAs_no_PAN <- pdac_raw %>% 
  mutate(across(-c(1,2),
                ~ ifelse(test = runif(n()) < 0.15, NA, .))) %>%
  filter(Label != "PAN")

Calculating PCA and sPCA in mixOmics

Now, let's use the data frames we have read into R to perform PCA and sPCA analysis on both the full dataset and the dataset that contains 15% missing values. Before we start, we can use ?pca() and ?spca() to see what parameters these functions take.

# Check parameters for pca function
?pca()

# Check parameters for spca function
?spca()

As we can see, there is some overlap in the arguments used in the pca() and spca() functions. Please refer to the R Documentation for more information, as these will not be discussed here. An important difference to note is the default values for the scale argument. In pca(), the scale argument is set to FALSE by default, while it is set to TRUE by default in spca(). The ncomp argument refers to the number of components (or reduced dimensions) the data will be reduced to. While the amount of variance in the dataset is maximised across PC1 and PC2 (and then decreasing across all further PCs), it is worth expanding the number of components to determine if important variance is captured by later PCs.

Now, we are ready to perform a PCA and sPCA analysis using mixOmics. Under the pca() function, we log10-transform our data frame (data matrix) to minimize the influence of outliers, we keep the center argument as TRUE to ensure PC1 describes the direction of maximum variance, and we change the scale argument to TRUE to prevent dominating the PCA by high-abundance lipids:

# Perform PCA on the full dataset
pca_full <- pca(X = log10(pdac_full[, 3:ncol(pdac_full)]), # the input will exclude the samples column and the label column; we also log10() transform our data frame (data matrix);
                ncomp = 5, # Number of components is set to 5
                scale = TRUE)
                
# Perform PCA on PDAC and healthy:
pca_full_no_PAN <- pca(X = log10(pdac_no_PAN[, 3:ncol(pdac_no_PAN)]),
                ncomp = 5,
                scale = TRUE) 
                
# Perform PCA on the dataset with 15% missing values                
pca_NAs <- pca(X = log10(pdac_NAs[, 3:ncol(pdac_NAs)]), # the input will exclude the samples column and the label column
                ncomp = 5,
                scale = TRUE)
                
# Perform PCA on the dataset with 15% missing values (PDAC vs. N only):
pca_NAs_no_PAN <- pca(X = log10(pdac_NAs_no_PAN[, 3:ncol(pdac_NAs_no_PAN)]), # the input will exclude the samples column and the label column
                      ncomp = 5,
                      scale = TRUE)

# Perform sPCA on the full dataset
spca_full <- spca(X = log10(pdac_full[, 3:ncol(pdac_full)]), # the input will exclude the samples column and the label column
                ncomp = 5,
                scale = TRUE)
                
# Perform sPCA on PDAC vs. healthy dataset
spca_no_PAN <- spca(X = log10(pdac_no_PAN[, 3:ncol(pdac_no_PAN)]), # the input will exclude the samples column and the label column
                  ncomp = 5,
                  scale = TRUE)

# Perform sPCA on the dataset with 15% missing values  
spca_NAs <- spca(X = log10(pdac_NAs[, 3:ncol(pdac_NAs)]), # the input will exclude the samples column and the label column
               ncomp = 5,
               scale = TRUE)

Executing these functions will perform the PCA or sPCA analysis. The output from these functions creates a list with all the associated data that can be assigned to a variable in the global environment. For a full description of each component of the output list, see the R Documentation using ?pca(). The important outputs that should be noted are; X - the input data matrix with observations scaled and centred (if scale = TRUE and center = TRUE); in the code block above, we additionally used log10() on our data frame (data matrix), loadings - a matrix of eigenvalues representing the influence of each variable on the principal components, variates - a matrix containing the PC values for projecting samples into the principal component space.

Let's have a look at some of these outputs, which can be called using the $ operator. We will use just the regular PCA on the full dataset for this.

# View the scaled and centered input matrix
View(pca_full$X)

# View the loadings matrix
View(pca_full$loadings[["X"]])

# View the variates matrix
View(pca_full$variates[["X"]])

Next, we will look at how we can visualise the PCA and sPCA data using the plotIndiv() function from the mixOmics package. It should be noted that if you want to create a custom ggplot, this can be done using the variates matrix.

Visualising PCA and sPCA Data from mixOmics

Plotting Samples with the plotIndiv().

Lets first look at plotting the samples and their position in the new principal component space. This is the standard PCA plot that is most commonly displayed. While PCA is an 'unsupervised' analysis, in that no prior knowledge of class labels is used to perform the analysis, the samples can be coloured according to a grouping variable. If samples appear to be clustering according to some class label, it can be inferred that the variance in the principal component is attributed to that class.

Check out the parameters included in the plotIndiv() function.

# Check R Documentation for plotIndiv()
?plotIndiv()

There is an extensive list of parameters that can be included in the plotIndiv() function, most of them are used to customise the output plot. Let's plot the samples from just the sPCA variations calculated above. Furthermore, let's colour samples by the grouping variable in the data frame that was used as input in the spca() function.

# Plot sPCA data from the full dataset and dataset with 15% missing values

# Full dataset 
spca_plt_full <- plotIndiv(spca_full, # First argument is the PCA list object
          group = pdac_full$Label, # The grouping variable in the pdac_full dataframe
          ind.names = F, # Labels for each datapoint will not be included
          rep.space = "XY-variate",
          comp = c(2, 3), # Plotting components 2 and 3 on the X and Y axis
          style = "ggplot2", # Produces a ggplot-like plot
          title = "sPCA on Full PDAC Data",
          col.per.group = c("royalblue", "orange", "red2"), # Colour assignment for the grouping variable
          ellipse = TRUE, # Draws 95% confidence ellipse around datapoints according to group
          centroid = FALSE, # Includes an additional datapoint at the centroid of each ellipse
          legend = TRUE, # Figure legend
          legend.title = "Group",
          abline = T, 
          alpha = 0)
          
# Dataset with 15% missing values
plotIndiv(spca_NAs,
          group = pdac_NAs$Label,
          ind.names = F,
          rep.space = "XY-variate",
          comp = c(2, 3),
          style = "ggplot2",
          title = "sPCA on PDAC Data (15% NAs)",
          col.per.group = c("red", "navyblue", "darkgreen"),
          ellipse = TRUE,
          centroid = FALSE,
          legend = TRUE,
          legend.title = "Group",
          abline = T,
          alpha = 0)
          
# PDAC vs. healthy controls:
spca_plt_no_PAN <- plotIndiv(spca_no_PAN, # First argument is the PCA list object
                          group = pdac_no_PAN$Label, # The grouping variable in the pdac_full dataframe
                          ind.names = F, # Labels for each datapoint will not be included
                          rep.space = "XY-variate",
                          comp = c(2, 3), # Plotting components 2 and 3 on the X and Y axis
                          style = "ggplot2", # Produces a ggplot-like plot
                          title = "sPCA on PDAC vs Healthy Controls Data",
                          col.per.group = c("royalblue", "red2"), # Colour assignment for the grouping variable
                          ellipse = TRUE, # Draws 95% confidence ellipse around datapoints according to group
                          centroid = FALSE, # Includes an additional datapoint at the centroid of each ellipse
                          legend = TRUE, # Figure legend
                          legend.title = "Group",
                          abline = T, 
                          alpha = 0)

As we can see above, the clustering of samples and relative percentages of explained variance for PC2 and PC3 from both the full PDAC dataset and the dataset with 15% missing values randomly included show little difference. Interestingly, the PAN patients are situated in the sPCA plots between healthy individuals and PDAC patients. Based on the sPCA analysis for PDAC patients vs. healthy controls, we see that the separation of both groups is captured along PC2 and PC3 with a 17% and 8% variance explained, respectively.

Plotting Loadings with plotLoadings().

Next, we will look at plotting the loadings of each variable as a bar plot. The mixOmics package comes with a plotting function, plotLoadings(), which takes a PCA list object as input, as well as several other arguments for plotting loadings. See ?plotLoadings for more information.

# Plot loadings from the sPCA analysis performed on the complete PDAC dataset. 
# We are mostly interested in components 2 and 3.
plotLoadings(spca_no_PAN, comp = 2)
plotLoadings(spca_no_PAN, comp = 3)

Executing this function provides the following plot.

While this plot is accurate and easy to produce, a more dynamic and customisable plot can be created by extracting the loadings data from the PCA object, assigning that to an object in the global environment, and creating a bar graph using ggplot2. See the following code an output as an example of this.

# Extract sPCA loadings data from sPCA object
spca_no_PAN_loadings2 <- spca_no_PAN$loadings$X %>% # extracts the loadings information
  as.data.frame() %>% # Converts to dataframe
  mutate(features = rownames(.)) %>% # Creates a feature column referring to the lipid species
dplyr::select(PC2, features) %>% 
  arrange(-PC2) %>% # Arranging the dataframe by loading values (descending order)
  slice_head(n = 20) # Selecting the top 20 features - this reduces over plotting but can be changed 

# Plot loadings using ggplot2
ggplot(data = spca_no_PAN_loadings2,
       mapping = aes(x = PC2,
                     y = features)) +
  geom_col(alpha = 0.5,
           fill = "firebrick") +
  labs(x = "",
       y = "Lipid Species",
       title = "Loadings on Component 2 (Healthy controls vs PDAC)") +
  xlim(-0.25, 0.25) +
  theme_bw() +
  theme(text = element_text(face = "bold"),
        axis.title.x = element_text(margin = margin(t = 10)),
        axis.title.y = element_text(margin = margin(r = 10)))

Calculating and Visualising PCA Data with FactoMineR and factoextra

In this example, we will be using the PCA() function from FactoMineR in combination with the plotting functions supplied by the factoextra package to calculate and plot PCA data. First, let's look at calculating a PCA using PCA(). This will have to be performed using the full PDAC dataset, as this PCA function can't handle NAs in the input data. Running the command ?PCA() will show all the input parameters for the PCA function.

# Check PCA function
?PCA()

Calculating PCA with FactoMineR

Now, let's calculate a PCA using the PCA() function from the FactoMineR package. First, we will log10-transformed the PDAC data set, and Autoscale it:

# Data log10-transformation & Autoscaling:
Autoscaling <- function(x) {(x-mean(x))/sd(x)}

pdac_full_log_scale <- 
  pdac_full %>%
  mutate_if(is.numeric, log10) %>%
  mutate_if(is.numeric, ~Autoscaling(.))

Similarly to the pca() and spca() functions, the output will produce a list that includes all of the PCA information. Assigning the PCA output to an object and calling that object in the R console will produce the following output.

# PCA with FactoMineR 
facto_PCA <- PCA(X = pdac_full_log_scale[, 3:ncol(pdac_full_log_scale)], # excluding the label and sample name information
                 scale.unit = F, # Scale data -- we already scaled our data
                 ncp = 5, # Number of components set to 5
                 graph = F) # Don't graph the results

# Print the facto_PCA output
facto_PCA

As we can see, this output list shares many similarities with the output list from the spca() and pca() functions from mixOmics. While all of the relevant information could be extracted from the PCA list and assigned to objects in the global environment using the $ and <- operators for plotting with ggplot2, the package factoextra provides a number of useful plotting functions that extract and plot this information automatically.

Visualising PCA data with factoextra

Now that we have calculated our PCA, let's visualise some of the associated plots using functions from the factoextra package. We will be focusing on creating an individuals plot using fviz_pca_ind() which is coloured according to the grouping variable, a loadings plot to show the influence of each variable on the principal components using the fviz_contrib() function, and a scree plot to show the percentage of variance explained across the principal components using the fviz_screeplot() function. First, let's check the input parameters for each of those functions using the ?function() syntax.

# Check function parameters
?fviz_pca_ind()
?fviz_contrib()
?fviz_screeplot()

Let's produce a standard individuals PCA plot with the samples coloured according to the group label in the PDAC data frame using the fviz_pca_ind() function. It should be noted that in order to group the data points and colour them by the grouping variable, it must be present in the data frame as the factor class. This can be easily done by overriding the class, as seen below

# Change grouping variable to factor
pdac_full_log_scale$Label <- factor(pdac_full_log_scale$Label)

# Plot PCA individuals data
fviz_pca_ind(facto_PCA, # the PCA list object produced by PCA() 
             label = "none", # Don't label individuals
             habillage = pdac_full_log_scale$Label,
             palette = c("royalblue", "orange", "red2"),
             addEllipses = T,
             title = "PCA on Full PDAC Data")

Executing the above code will produce the following plot.

Now let's look at producing a loadings plot, which shows the contributions of each variable towards the principal components. We can produce this plot using the fviz_contrib() function.

# Plot the variable contributions to the second principal component
fviz_contrib(facto_PCA, # PCA output list
             choice = "var",  # plot variable contributions
             axes = 2, # second principal component
             top = 20) # top 20 features

The above plot can be customised to an extent to show different bar colours and fills or even contribution orders. However, factoextra also provides functions for extracting variable and individual information from PCA objects. Executing the code below will store the variable contributions for each component into a data frame, which can be assigned to the global environment. This data frame can then be used similarly to the example above to create a custom bar graph of loadings using ggplot().

# Extract PCA loadings information
facto_PCA_loadings <- get_pca_var(facto_PCA)$contrib

# View output
View(facto_PCA_loadings)

Finally, we will look at creating a scree plot using the fviz_screeplot() function. A scree plot shows the percentage of variance explained by all principal components and is commonly visualised as a bar graph.

# Create a scree plot 
fviz_screeplot(facto_PCA, # PCA object
               addlabels = T, # Add percentage values to the individual bars
               ylim = c(0, 35), # set the limits of the y-axis
               title = "Scree Plot of PDAC PCA Data") # Custom title for graph

The reconstruction of the PCA analysis presented in the manuscript (advanced version only)

Below, you can download the data set used to perform the PCA analysis from the article:

We will share the code used to prepare all the visualizations. Begin with reading the data set into R:

# Principal component analysis script
# Step one - reading the data into R:
## Activate the libraries
library(readxl)
library(tidyverse)
### Adjusting labels for loading plot (score plot optionally):
install.packages("ggrepel")
library(ggrepel)

## Set the working directory:
setwd("D:/Data analysis")

## Download the data set from above into your working directory.
## Load it into R:
data <- as_tibble(read_xlsx(file.choose()))

In the following steps, we will adjust the column types, transform and scale the data, and prepare a matrix for the PCA computation:

# Adjust variable types
data$`Sample Name` <- as.character(data$`Sample Name`)
data$Label <- as.factor(data$Label)

# Drop sample names column
data.no.ID <- data %>% select(-`Sample Name`)

# Auto- and Pareto-scaling of the data (functions):
Autoscaling <- function(x) {(x-mean(x))/sd(x)} # Optionally.
Paretoscaling <- function(x) {(x-mean(x))/sqrt(sd(x))} # To be used.

# Data transformation
data.PCA <- 
  data.no.ID %>%
  mutate_if(is.numeric, log) %>%
  mutate_if(is.numeric, ~Paretoscaling(.))
  
# Create a matrix with data for PCA analysis
X <- as.matrix(data.PCA %>% select(-Label))

The PCA is computed using prcomp()function from the base R:

# PCA computation
PCA <- prcomp(X, center = FALSE) # Our data set was Pareto-scaled above.

Now, we will extract the data needed to obtain the score plot of the PCA and prepare the visualization using the ggplot library:

# Extract PCA scores 
scores <- as.data.frame(PCA$x)

# Preparing data for plotting:
## PC1 and PC2
PC1 <- scores$PC1
PC2 <- scores$PC2
Label <- data.PCA$Label

## Build a data frame with all values (score plot):
PCA_plot <- data.frame(PC1, PC2, Label)

# Computations of variance explained

  ## Simple method 
  summary(PCA)
  ## Under 'Proportion of variance' PC1 = 42.0%, PC2 = 14.4%

  ## Compute it from the information in the PCA object
  var <- as.data.frame(PCA$sdev^2)
  
    ### Variance explained by the PC1
    var1 <- var[1,1]/sum(var)*100 # in [%]
    ### Variance explained by the PC2
    var2 <- var[2,1]/sum(var)*100 # in [%]
    ### Variance explained by the PC3
    var3 <- var[3,1]/sum(var)*100 # in [%]
    ### Variance explained by the PC4
    var4 <- var[4,1]/sum(var)*100 # in [%]
    ### Variance explained by the PC5
    var5 <- var[5,1]/sum(var)*100 # in [%]
    ## ...
    
# But it can be automated
  # Rows number
  rows <- nrow(loadings)    
  # Data frame for scree plot
  scree_plot <- tibble(
  PC = character(length =  36),
  `Variance explained` = numeric(length =  36),
  `Sum` = numeric(length = 36))

  
  # Loop performing the above calculation and putting every value in the scree plot df
  for (i in 1:36) {
  scree_plot$PC <- 1:36
  scree_plot[i, 2] <- var[i,1]/sum(var)*100
  }
  
  # Loop calculating cumulative var explained
  for (i in 1:36) {
      scree_plot[1,3] <- var[1,1]/sum(var)*100
      scree_plot[i+1,3] <- (scree_plot[i,3])+(var[i+1,1]/sum(var))*100
    }
  
  # Selecting data for plotting the scree plot:
  scree_plot_plotting <- as_tibble(scree_plot[1:6,1:3]) 
  glimpse(scree_plot_plotting)
  
  scree.plot.longer <- scree_plot_plotting %>% pivot_longer(cols = `Variance explained`:`Sum`,
                                                                     names_to = 'PCs', 
                                                                     values_to = 'val')
  # Ensuring that PCs are considered a factor variable:
scree.plot.longer$PCs <- as.factor(scree.plot.longer$PCs)

# Scores plot via ggplot2
scores_plot <- 
  ggplot(PCA_plot,aes(x=PC1, y = PC2, fill = Label))+
  scale_fill_manual(values = c('royalblue', 'red2', 'yellow')) +
  geom_point(size = 4, shape = 21)+
  theme_bw()+
  theme(axis.text.x = element_text(size = 16, color = 'black'),
        axis.text.y = element_text(size = 16, color = 'black'),
        axis.title.x = element_text(size = 16),
        axis.title.y = element_text(size = 16),
        legend.text = element_text(size = 16),
        legend.title = element_text(size = 16),
        title = element_text(size = 16),
        legend.position = c("bottom"))+
  labs(x = 'PC1 var. explained = 42.0%',
       y = 'PC2 var. explained = 14.1%',
       title = 'Score plot',
       fill = "Health status")

# To see the scores plot:
scores_plot

We obtain the following score plot:

The code below will allow obtaining the scree plot (based on computations from above):

# Scree plot   
# Variance explained by 5 components
   scree_plot <- 
     ggplot(scree.plot.longer, aes(x = PC, y = val, group = PCs, fill = PCs)) +
    geom_line() +
    geom_point(shape = 21, size = 4) +
    scale_fill_manual(values = c('darkgreen', 'red2')) +
    guides(fill = guide_legend(override.aes = list(shape = 21))) +
    geom_text(label = round(scree.plot.longer$val,2), hjust = 0.5, vjust = -0.5, size = 6) +
    theme_bw() +
    theme(axis.text.x = element_text(size = 16, color = 'black'),
          axis.text.y = element_text(size = 16, color = 'black'),
          legend.text = element_text(size = 16),
          legend.title = element_blank(),
          title = element_text(size = 16),
          legend.position = 'bottom') +
    labs(y = 'Variance explained [%]',
         x = 'Principal component',
         title = 'Scree plot')
    
    
   scree_plot

We obtain the following output:

Finally, we will separate the necessary information for plotting loadings:

# Loadings plot
# Extract loadings from the PCA:
loadings <- as.data.frame(PCA$rotation)

# PC1 and PC2 loadings:
L1 <- loadings$PC1
L2 <- loadings$PC2
Lipids <- rownames(loadings)

# Build a data frame with all values for plotting loadings plot:
Loadings <- data.frame(L1, L2, Lipids)

# Loadings plot via ggplot2:
loadings_plot <- ggplot(Loadings,aes(x=L1, y = L2))+
  geom_point(size = 4)+
 geom_text_repel(aes(label = Lipids), 
                           size = 4, 
                           min.segment.length = unit(0.3, "lines"), 
                           max.overlaps = getOption("ggrepel.max.overlaps", 
                                                    default = 20)) +
  theme_bw()+
  theme(axis.text.x = element_text(size = 16, color = 'black'),
        axis.text.y = element_text(size = 16, color = 'black'),
        axis.title.x = element_text(size = 16),
        axis.title.y = element_text(size = 16),
        legend.text = element_text(size = 16),
        legend.title = element_text(size = 16),
        title = element_text(size = 16))+
  labs(x = 'Loadings 1',
       y = 'Loadings 2',
       title = 'Loading plot') 

# To be see loadings plot run:
loadings_plot

The loadings plot obtained using the code above (after gentle modification using InkScape software):

# Installation of the library:
install.packages("ggimage")

# Call library:
library(ggimage)

# Score plot
## Generate a preview and optimize the plot presentation (scores plot):
ggpreview(plot = scores_plot,               # The object that you want to preview.
          width = 300,               # Width in px.
          height = 300,              # Height in px.
          units = "px",              # Unit - of size - px.
          dpi = 300,                 # Sharpness.
          scale = 7)            # You may need to use a different scale.


## Save the plot in the working directory using ggsave (ggplot2 package - tidyverse):
ggsave(plot = scores_plot,    # The R object to be saved.        
       device = "svg",           # Format.
       filename = "Scores_plot_gout.svg",
       width = 300,
       height = 300,
       units = "px",
       dpi = 300,
       scale = 7)

# Scree plot
## Generate a preview and optimize the scree plot:
ggpreview(plot = scree_plot,               # The object that you want to preview.
          width = 180,               # Width in px.
          height = 300,              # Height in px.
          units = "px",              # Unit - of size - px.
          dpi = 300,                 # Sharpness.
          scale = 7)            # You may need to use a different scale.

## Save the plot in the working directory using ggsave (ggplot2 package - tidyverse):
ggsave(plot = scree_plot,    # The R object to be saved.        
       device = "svg",           # Format.
       filename = "Scree_plot_gout.svg",
       width = 180,
       height = 300,
       units = "px",
       dpi = 300,
       scale = 7)

# Loading plot
## Generate a preview and optimize the plot presentation (loading plot):
ggpreview(plot = loadings_plot,               # The object that you want to preview.
          width = 300,               # Width in px.
          height = 300,              # Height in px.
          units = "px",              # Unit - of size - px.
          dpi = 300,                 # Sharpness.
          scale = 7)            # You may need to use a different scale.


## Save the plot in the working directory using ggsave (ggplot2 package - tidyverse):
ggsave(plot = loadings_plot,    # The R object to be saved.        
       device = "svg",           # Format.
       filename = "Loadings_plot_gout.svg",
       width = 300,
       height = 300,
       units = "px",
       dpi = 300,
       scale = 7)

All the obtained plots can be further processed, e.g., enriched in BioRender.com with additional elements or modified using a selected graphic design software.

Alternatively, the pca() and spca() from mixOmics can natively handle NA values in the input data through the implementation of the Non-linear Iterative Partial Least Squares (NIPALS) algorithm. For more information, refer to the mixOmics webpage (). For this reason, the mixOmics pacakge is a great solution for instances where data imputation wants to be avoided while retaining all features in the dataset.

To ensure high-quality graphics are exported for the article, we will use ggimage library ():

https://doi.org/10.1186/s13075-023-03204-6
https://doi.org/10.1186/s12974-024-03060-4
https://doi.org/10.1007/s00216-022-04490-w
https://doi.org/10.1093/brain/awae158
https://doi.org/10.1136/gutjnl-2024-332972
https://doi.org/10.1038/s41467-021-27765-9
https://doi.org/10.1038/s42255-023-00880-1
https://doi.org/10.1021/acs.jproteome.3c00443
http://mixomics.org/methods/spca/
http://mixomics.org/methods/missing-values/
https://cran.r-project.org/web/packages/ggimage/index.html
3MB
GOUT_CTRL_QC_Ales_data_31012025.xlsx
The lipidomics data set published by Kvasnička et al. in their manuscript Alterations in lipidome profiles distinguish early-onset hyperuricemia, gout, and the effect of urate-lowering treatment; Arthritis Research & Therapy (2023).
pca() arguments
spca() arguments
Log10-transformed, scaled, and centered input matrix.
Loadings matrix.
PCA variates matrix.
sPCA individuals plot on the full PDAC dataset using the plotIndv() function.
sPCA individuals plot on the PDAC dataset containing 15% missing values using the plotIndv() function.
sPCA plot - PDAC vs. healthy controls created using plotIndv() function.
Loadings of sPCA.
Loadings plot from PDAC subset data generated using ggplot().
PCA() Function from FactoMineR
PCA individuals plot using fviz_pca_ind()
Loadings plot of PC1 using the fviz_contrib() function.
PCA variable loadings extracted using get_pca_var()
Scree plot created using fviz_screeplot()
Score plot obtained for the LC-MS data set (gout patients vs healthy controls).
Scree plot obtained for the LC-MS data set (gout patients vs healthy controls).
Loading plot obtained for the LC-MS data set (gout patients vs healthy controls).