The required packages for this section are pandas, seaborn and scikit-learn. These can be installed with the following command in the command window (Windows) / terminal (Mac).
pip install pandas seaborn scikit-learn
Loading the data
We will again use the demo lipidomics dataset:
Load the dataset into a Pandas DataFrame named df as described in the basic plotting section:
The first step is to normalise the data such that the features (lipids) have zero mean and unit variance, this can easily be done with the StandardScaler from sklearn. By indexing the dataframe with df.iloc[:,1:] we'll select all the data in the dataframe, except for the first column, which contains the labels:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_normalised = scaler.fit_transform(df.iloc[:,1:])
PCA
Next, we use the PCA sklearn from sklearn, we'll select the 10 first principal components and we'll apply the PCA algorithm to the normalised data:
We'll extract the loadings from the PCA results and put them in a DataFrame:
loadings = pd.DataFrame(data= pca.components_.T,
columns=[f"PC{i+1}" for i in range(n_components)],
index=df.columns[1:])
Next we can visualise the loadings scores in a scatterplot, including the labels of the features for each datapoint:
fig, ax = plt.subplots(figsize=(15,15))
p1 = sns.scatterplot(x='PC1',
y='PC2',
data=loadings,
size = 15,
ax=ax,
legend=False)
for line in range(0,loadings.shape[0]):
p1.text(loadings.PC1.iloc[line]+0.002, loadings.PC2.iloc[line],
df.columns[line], horizontalalignment='left',
size='small', color='black')
plt.show();
PCA on data with missing values
The standard PCA algorithm does not handle datasets with missing values. As an alternative to missing value imputation prior to PCA, there exist PCA variants that can deal with missing values directly. This approach of avoiding imputation of missing values directly can reduce bias. A popular PCA implementation that can handle missing values is Probabilistic Principal Component Analysis (PPCA), first described by Michael E. Tipping and Christopher M. Bishop (Neural Computation (1999) 11 (2): 443–482.).
Required packages
The required packages for this section are pandas, seaborn and scikit-learn. These can be installed with the following command in the command window (Windows) / terminal (Mac).
pip install pandas seaborn scikit-learn
In addition we need to install the pyppca package from github:
We will use the demo lipidomics dataset with missing values:
Data normalisation
The first step is to normalise the data such that the features (lipids) have zero mean and unit variance, this can easily be done with the StandardScaler from sklearn. By indexing the dataframe with df.iloc[:,1:] we'll select all the data in the dataframe, except for the first column, which contains the labels:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_normalised = scaler.fit_transform(df.iloc[:,1:])
PPCA
Next, we use the PPCA function from pyppca, we'll select the 10 first principal components and we'll apply the PCA algorithm to the normalised data. The first argument of the ppca function is a numpy array that contains the data, the second argument is the number of principal components, and the final argument is a boolean that indicates if algorithmic run details should be printed:
n_components = 10
C, ss, M, X, Ye = ppca(df_normalised, n_components, False)
The returned variables are:
ss: ( float ) isotropic variance outside subspace
C: (D by d ) C*C' + I*ss is covariance model, C has scaled principal directions as cols
M: (D by 1 ) data mean
X: (N by d ) expected states
Ye: (N by D ) expected complete observations (differs from Y if data is missing)
The PPCA function from pyppca return the results as a numpy ndarray in the X variable, let's put these results in a Pandas DataFrame:
pca_features = pd.DataFrame(data= X,
columns=[f"PC{i+1}" for i in range(n_components)],
index=df.index)
pca_features["Label"] = df.Label
We can now visualise the projection of the samples to the new feature space with a scatterplot:
import seaborn as sns
import matplotlib.pyplot as plt
sns.scatterplot(x='PC1',
y='PC2',
data=pca_features,
hue="Label",
palette=["royalblue","orange", "red"]);
plt.show()
Explained variance
We can visualise the explained variance of the first n (we chose 10) principal components with: