PLS Discriminant Analysis
Last updated
Last updated
The required packages for this section are pandas, seaborn and scikit-learn. These can be installed with the following command in the command window (Windows) / terminal (Mac).
We will again use the demo lipidomics dataset:
Load the dataset into a Pandas DataFrame named df as described in the basic plotting section:
The first step is to normalise the data such that the features (lipids) have zero mean and unit variance, this can easily be done with the StandardScaler from sklearn. By indexing the dataframe with df.iloc[:,1:] we'll select all the data in de dataframe, except for the first column, which contains the labels:
The PLS algorithm was developed for continuous regression problems, but in PLS-DA we use the PLS algorithm with the response vector y being an integer representation of the group labels. To encode the labels as integers, we can use the LabelEncoder from sklearn:
Next, we'll use the PLSRegression algorithm from sklearn, and we select 2 components:
We put the results in a DataFrame, together with the group labels:
Now we can plot the results:
The loadings of PLS-DA can be calculated as the Pearson correlation between the original centred and scaled data and the components T (x-scores):
We'll put the calculated loadings in a DataFrame:
And to visualise the loadings:
Once a PLS model is constructed, it can be used to predict the labels of new samples (the labels are N, PAN or T in our demo dataset, but for this example we'll restrict the model to N and T samples) . Since we don't have an independent dataset, we'll split some random samples from our dataset before training the model, the training set will be used to construct the PLS-DA model, the test set to evaluate its performance. Since we are now interested getting a model that is capable of making accurate predictions and not simply in visualising the data like in the previous section, we don't have to limit ourselves to just 2 components in our model, we'll perform a cross validation experiment to find the number of components that gives the best predictive performance. These concepts are explained in more detail in the R section on PLS-DA.
First, we'll limit our dataset to N and T samples, and we'll store the labels of those samples in a separate variable:
For the PLS to work, we'll have to convert the categorical outcome variable (the labels T and N) to zeros and ones:
Next, we'll split the dataset and labels into a training set (70% of the samples) and test set (30% of the samples). We use the stratify option to ensure that the training and test sets have the same proportion of T and N samples:
Next, we'll perform a 5 fold cross-validation (cv =5) to find the optimal number of components (we'll test a range of 1 to 6)
We observe the following output (the values may vary slightly due to the random nature of the splitting the samples in training and test sets):
We observe that the scores increase up until 3 components, after which they start decreasing, making 3 components the best choice for this model. We now train a PLS model with 3 components on the training dataset, and we'll use this model to get predictions on the test dataset.
Remember that PLS-DA is really just a PLS where we converted the categorical labels to zeros and ones, and PLS is a regression model which predicts a continuous output. We'll map all predicted values > 0.5 to 1 and <0.5 to zero:
Finally, by comparing the predicted labels with the actual labels, we can calculate the accuracy of the model and we can calculate a confusion matrix:
We observe the following output (may again be slightly different):