Omics data visualization in R and Python

Correlation analysis

PreviousScatter plots and linear regression NextTwo sample comparisons in R

Last updated 6 months ago

Correlation analysis

Required packages

The required packages for this section are pandas and seaborn. These can be installed with the following command in the command window (Windows) / terminal (Mac).

pip install pandas seaborn

Loading the data

Like in the other sections we will use the lipidomics demo dataset:

import pandas as pd
import numpy as np
df = pd.read_excel("Lipidomics_dataset.xlsx", decimal=",")
df.set_index("Sample Name", inplace=True)

Correlation plots

Creating a basic correlation plot is again very simple with Pandas and Seaborn:

import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(df.corr(numeric_only=True));
plt.show()

We can just call the heatmap funtion from seaborn, and call the corr funtion on the DataFrame (with numeric_only=True to ignore non numeric the Label column in our Dataframe) and pass the calculated correlation matrix as an argument to the heatmap function. The results will look something like this:

There are several problems with this plot that we'll address one by one. For starters there are many more lipids in this plot than there is space on the axis for the axis labels. We can fix this by setting the canvas size and the axis labels size on the ax object that we import from matplotlib (you may need to further adjust the figsize and labelsize parameters):

fig, ax = plt.subplots(figsize=(20,20))
ax.tick_params(axis='both', which='major', labelsize=6)
sns.heatmap(df.corr(numeric_only=True), ax=ax);
plt.show()

Next, correlation values are located in the interval [-1;1], so it would make more sense to have a color scalebar that is white at zero, diverges to a different color for positive and negative values, and is set to -1 for the minimum value and +1 for the maximum:

sns.heatmap(df.corr(numeric_only=True), ax=ax, cmap='vlag',vmin=-1, vmax=1);
fig, ax = plt.subplots(figsize=(20,20))
ax.tick_params(axis='both', which='major', labelsize=6)
plt.show()

Finally, we'll remove the redundant symmetry. We'll create a mask by first creating a matrix of ones with the same dimensions as the lipid DataFrame (using the numpy ones_like funtion, and we'll set the data type to bool) and then we'll set the values above the diagonal to zero with the numpy triu function:

mask = np.triu(np.ones_like(df.corr(numeric_only=True), dtype=bool))

we'll then pass in this mask to the mask parameter of the heatmap funtion:

sns.heatmap(df.corr(numeric_only=True), ax=ax, cmap='vlag',vmin=-1, vmax=1);
fig, ax = plt.subplots(figsize=(20,20))
ax.tick_params(axis='both', which='major', labelsize=6)
plt.show()

Finally, in correlation maps with fewer variables, it may be interesting to set add "annot=True" to the heatmap parameters, as this will show the actual correlation values on the map.