Principal Component Analysis (PCA)
Metabolites and lipids multivariate statistical analysis in R
Last updated
Metabolites and lipids multivariate statistical analysis in R
Last updated
A principal component analysis (PCA) is an excellent dimensionality reduction technique that can be used to extract and visualise variance in a lipidomics and metabolomics dataset without knowledge of class labels.
PCA is among the most often utilized multivariate statistical methods in metabolomics and lipidomics, primarily for the determination of sources of variance but also for quality control. Check out the following examples below:
A. Kvasnička et al. Alterations in lipidome profiles distinguish early-onset hyperuricemia, gout, and the effect of urate-lowering treatment. DOI: - Fig. 1A Fig. 5 (the authors utilize PCA for comparing healthy controls and patients' lipid profiles, e.g., they observe the separation of controls from patients along PC1 in all 3 PCA score plots based on plasma lipid composition; morover, they present measurements integrity - clustering of QC samples).
D. Olešová et al. Changes in lipid metabolism track with the progression of neurofibrillary pathology in tauopathies. DOI: - Fig. 3A, Fig. 4A, Fig. 5H (the authors use PCA analysis for presenting distinct metabolic patterns between experimental groups).
J. Idkowiak et al. Robust and high-throughput lipidomic quantitation of human blood samples using flow injection analysis with tandem mass spectrometry for clinical use. DOI: - Fig. 7B - C, Fig. 8A (the authors employ PCA analysis for comparing results obtained by different analytical methods after batch effect removal; in the case of Fig. 8A - authors present that second source of variance separates PDAC patients from healthy controls along PC2, as well as analytical method integrity by clustering of QC samples).
R. C. Prior, A. Silva & T. Vangansewinkel et al. PMP22 duplication dysregulates lipid homeostasis and plasma membrane organization in developing human Schwann cells. DOI: - e.g., Fig. 2A (the authors utilize PCA for comparing lipid profiles of sciatic nerves from C3 mice versus their wild-type (WT) littermate controls at 5 weeks of age).
Y. Yu et al. Unravelling lipidomic disruptions across multiple tissues in Chd8-mutant ASD mice through integration of lipidomics and single-cell transcriptomics. DOI: - Fig. 1b, Fig. 2h, (the authors of the study published in Gut utilize PCA analysis for analysis of similarities and differences between experimental groups).
D. Wolrab et al. Lipidomic profiling of human serum enables detection of pancreatic cancer. DOI: - Fig. 5a, Fig. 6a (the authors utilize PCA for studying similarities and differences in lipid profiles between PDAC patients and healthy controls).
D. Hornburg et al. Dynamic lipidome alterations associated with human health, disease and ageing. DOI: - Fig. 3a (the authors apply PCA for comparing two experimental groups (IR and IS), based on lipidome).
N. De Silva Mohotti et al. Lipidomic Analysis Reveals Differences in the Extent of Remyelination in the Brain and Spinal Cord. DOI: - Fig. 4, Fig. 5, Fig. 8 & Fig. 9 (the authors used PCA for comparing lipidomes among different experimental groups in a study dedicated to remyelination in the brain and spinal cord, i.e., overlapping or separation of biological groups in the PCA score plot).
There are several packages available in R that include functions capable of computing and often visualising PCA data such as; prcomp() from the base R stats package, PCA() from the FactoMineR package and pca() from the mixOmics package. Here we will focus on the implementation of the FactoMineR and the mixOmics package for the following reasons. First, mixOmics provides an easily customisable plotting function with plotIndiv() that can be used to plot individuals from PCA data in addition to several other informative plots. Similarly, the factoextra package provides plotting functions that can be used to extract PCA data directly from the PCA() output of FactoMineR. The mixOmics package also offers the Sparse PCA function spca(), which is an alternative to regular PCA and is suitable for large omics datasets. More information on Sparce PCA can be found on the mixOmics webpage ().
As described earlier in this GitBook, missing values are common in lipidomics and metabolomics datasets generated by mass spectrometry based methods. For many statistical tests and models, missing values must be dealt with prior to the analysis. As described in the Missing Values Handling in R section, features with a large percentage of NA values can be filtered based on some pre-selected criteria. Alternatively, imputing missing values based on strategies such as k-nearest neighbours (kNN), or some percentage of the lowest concentration for each feature, is a valid approach that ensures each feature in the dataset is retained.
PCA analysis is one such example that is sensitive to missing values. The prcomp() and PCA() from the stats and FactoMineR packages reviewed below require datasets without missing values. If data containing missing values are given to the above functions in R, the following messages will be printed in the console.
Executing the prcomp() function from the stats package:
While the PCA() function from FactoMineR package will only produce a warning message, however imputation will be performed for you based on the mean of each variable.
Ensure the mixOmics, factoextra and FactoMineR packages are installed and loaded. For data transformation and formatting, functions from dplyr will also be used. It should be noted that loading the mixOmics package will override the use of the select() function from dplyr. As such, calling the function using the dplyr::select() syntax will be required.
Here, we will again use the example PDAC lipidomics dataset to showcase both the mixOmics and factoextra packages. Additionally, to display the spca() and pca() functions capability of handling NA values, we will create an additional dataset with 15% of the data randomly replaced with NA.
Now, let's use the data frames we have read into R to perform PCA and sPCA analysis on both the full dataset and the dataset that contains 15% missing values. Before we start, we can use ?pca() and ?spca() to see what parameters these functions take.
As we can see, there is some overlap in the arguments used in the pca() and spca() functions. Please refer to the R Documentation for more information, as these will not be discussed here. An important difference to note is the default values for the scale argument. In pca(), the scale argument is set to FALSE by default, while it is set to TRUE by default in spca(). The ncomp argument refers to the number of components (or reduced dimensions) the data will be reduced to. While the amount of variance in the dataset is maximised across PC1 and PC2 (and then decreasing across all further PCs), it is worth expanding the number of components to determine if important variance is captured by later PCs.
Now, we are ready to perform a PCA and sPCA analysis using mixOmics. Under the pca() function, we log10-transform our data frame (data matrix) to minimize the influence of outliers, we keep the center argument as TRUE to ensure PC1 describes the direction of maximum variance, and we change the scale argument to TRUE to prevent dominating the PCA by high-abundance lipids:
Executing these functions will perform the PCA or sPCA analysis. The output from these functions creates a list with all the associated data that can be assigned to a variable in the global environment. For a full description of each component of the output list, see the R Documentation using ?pca(). The important outputs that should be noted are; X - the input data matrix with observations scaled and centred (if scale = TRUE and center = TRUE); in the code block above, we additionally used log10() on our data frame (data matrix), loadings - a matrix of eigenvalues representing the influence of each variable on the principal components, variates - a matrix containing the PC values for projecting samples into the principal component space.
Let's have a look at some of these outputs, which can be called using the $ operator. We will use just the regular PCA on the full dataset for this.
Next, we will look at how we can visualise the PCA and sPCA data using the plotIndiv() function from the mixOmics package. It should be noted that if you want to create a custom ggplot, this can be done using the variates matrix.
Lets first look at plotting the samples and their position in the new principal component space. This is the standard PCA plot that is most commonly displayed. While PCA is an 'unsupervised' analysis, in that no prior knowledge of class labels is used to perform the analysis, the samples can be coloured according to a grouping variable. If samples appear to be clustering according to some class label, it can be inferred that the variance in the principal component is attributed to that class.
Check out the parameters included in the plotIndiv() function.
There is an extensive list of parameters that can be included in the plotIndiv() function, most of them are used to customise the output plot. Let's plot the samples from just the sPCA variations calculated above. Furthermore, let's colour samples by the grouping variable in the data frame that was used as input in the spca() function.
As we can see above, the clustering of samples and relative percentages of explained variance for PC2 and PC3 from both the full PDAC dataset and the dataset with 15% missing values randomly included show little difference. Interestingly, the PAN patients are situated in the sPCA plots between healthy individuals and PDAC patients. Based on the sPCA analysis for PDAC patients vs. healthy controls, we see that the separation of both groups is captured along PC2 and PC3 with a 17% and 8% variance explained, respectively.
Next, we will look at plotting the loadings of each variable as a bar plot. The mixOmics package comes with a plotting function, plotLoadings(), which takes a PCA list object as input, as well as several other arguments for plotting loadings. See ?plotLoadings for more information.
Executing this function provides the following plot.
While this plot is accurate and easy to produce, a more dynamic and customisable plot can be created by extracting the loadings data from the PCA object, assigning that to an object in the global environment, and creating a bar graph using ggplot2. See the following code an output as an example of this.
In this example, we will be using the PCA() function from FactoMineR in combination with the plotting functions supplied by the factoextra package to calculate and plot PCA data. First, let's look at calculating a PCA using PCA(). This will have to be performed using the full PDAC dataset, as this PCA function can't handle NAs in the input data. Running the command ?PCA() will show all the input parameters for the PCA function.
Now, let's calculate a PCA using the PCA() function from the FactoMineR package. First, we will log10-transformed the PDAC data set, and Autoscale it:
Similarly to the pca() and spca() functions, the output will produce a list that includes all of the PCA information. Assigning the PCA output to an object and calling that object in the R console will produce the following output.
As we can see, this output list shares many similarities with the output list from the spca() and pca() functions from mixOmics. While all of the relevant information could be extracted from the PCA list and assigned to objects in the global environment using the $ and <- operators for plotting with ggplot2, the package factoextra provides a number of useful plotting functions that extract and plot this information automatically.
Now that we have calculated our PCA, let's visualise some of the associated plots using functions from the factoextra package. We will be focusing on creating an individuals plot using fviz_pca_ind() which is coloured according to the grouping variable, a loadings plot to show the influence of each variable on the principal components using the fviz_contrib() function, and a scree plot to show the percentage of variance explained across the principal components using the fviz_screeplot() function. First, let's check the input parameters for each of those functions using the ?function() syntax.
Let's produce a standard individuals PCA plot with the samples coloured according to the group label in the PDAC data frame using the fviz_pca_ind() function. It should be noted that in order to group the data points and colour them by the grouping variable, it must be present in the data frame as the factor class. This can be easily done by overriding the class, as seen below
Executing the above code will produce the following plot.
Now let's look at producing a loadings plot, which shows the contributions of each variable towards the principal components. We can produce this plot using the fviz_contrib() function.
The above plot can be customised to an extent to show different bar colours and fills or even contribution orders. However, factoextra also provides functions for extracting variable and individual information from PCA objects. Executing the code below will store the variable contributions for each component into a data frame, which can be assigned to the global environment. This data frame can then be used similarly to the example above to create a custom bar graph of loadings using ggplot().
Finally, we will look at creating a scree plot using the fviz_screeplot() function. A scree plot shows the percentage of variance explained by all principal components and is commonly visualised as a bar graph.
Below, you can download the data set used to perform the PCA analysis from the article:
We will share the code used to prepare all the visualizations. Begin with reading the data set into R:
In the following steps, we will adjust the column types, transform and scale the data, and prepare a matrix for the PCA computation:
The PCA is computed using prcomp()
function from the base R:
Now, we will extract the data needed to obtain the score plot of the PCA and prepare the visualization using the ggplot library:
We obtain the following score plot:
The code below will allow obtaining the scree plot (based on computations from above):
We obtain the following output:
Finally, we will separate the necessary information for plotting loadings:
The loadings plot obtained using the code above (after gentle modification using InkScape software):
All the obtained plots can be further processed, e.g., enriched in BioRender.com with additional elements or modified using a selected graphic design software.
Alternatively, the pca() and spca() from mixOmics can natively handle NA values in the input data through the implementation of the Non-linear Iterative Partial Least Squares (NIPALS) algorithm. For more information, refer to the mixOmics webpage (). For this reason, the mixOmics pacakge is a great solution for instances where data imputation wants to be avoided while retaining all features in the dataset.
To ensure high-quality graphics are exported for the article, we will use ggimage library ():