Detecting missing values

Required packages

The required packages for this section are pandas, matplotlib and seaborn. These can be installed with the following command in the command window (Windows) / terminal (Mac).

pip install pandas matplotlib seaborn

Loading the data

248KB

Lipidomics_missing_values_EXAMPLE.xlsx

Place the downloaded Lipidomics_missing_values_EXAMPLE.xlsx file in the same folder as your JupyterLab script. Then run the following code in Jupyter:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_excel("Matrix_missing_values_EXAMPLE.xlsx", decimal=",")
df.set_index("Sample Name", inplace=True)

We can generate a heatmap visualisation of the missing values across the table (white values indicate a missing value):

plt.figure(figsize=(24, 30))  # Modify the width and height as needed
sns.heatmap(df.isnull(), cbar=False)
plt.savefig("missing_values_heatmap.png", dpi=200, bbox_inches='tight') 
plt.show()

We can visualize the % missing values in the samples:

# Calculate percentage of missing values per row (sample)
missing_percentage_per_sample = df.isnull().mean(axis=1) * 100

# Bar plot for missing values per sample
plt.figure(figsize=(30, 6))
missing_percentage_per_sample.sort_values(ascending=False).plot(kind='bar')
plt.title("Percentage of Missing Values per sample")
plt.xlabel("Samples")
plt.ylabel("Percentage Missing")
plt.tight_layout()
plt.show()

And for the species:

# Calculate percentage of missing values per row (sample)
missing_percentage_per_sample = df.isnull().mean(axis=1) * 100

# Bar plot for missing values per sample
plt.figure(figsize=(30, 6))
missing_percentage_per_sample.sort_values(ascending=False).plot(kind='bar')
plt.title("Percentage of Missing Values per sample")
plt.xlabel("Samples")
plt.ylabel("Percentage Missing")
plt.tight_layout()
plt.show()

PreviousReplacing NAs via random forest (RF) model (randomForest library)NextFiltering out columns containing mostly NAs

Last updated 5 months ago