Fundamental data structures

Python has several built-in primitive data types, the most important for data analysis being:

  • Integer (int): Whole numbers, e.g., x = 10

  • Floating Point (float): Decimal numbers, e.g., y = 3.14

  • String (str): Sequence of characters, e.g., text = "Metabolites ID"

  • Boolean (bool): Logical values, True or False (This can be useful to indicate if a sample or metabolite should be selected from a data table)

  • None (NoneType): None is used to define a null value, or no value at all.

Lists in Python

A list is the most basic data structure in Python used to store data. Unlike R, Python follows zero-based indexing, meaning the first element in a list is accessed with index 0. Lists in Python can hold heterogeneous data types (mixed types of data in a single list).

Creating a List

Lists are defined using square brackets [] and elements are separated by commas:

# List containing strings
lipids_list = ["Cholesterol", "PC 34:1", "PC 34:2", "TG 54:2"]

# List containing integers
int_list = [1, 100, 5, 4]

# List containing floating-point numbers (doubles)
float_list = [1.2, 3.5, 5.78, float('inf'), float('-inf'), float('nan')]

# List containing boolean values
bool_list = [True, False, True, True, False]

print(lipids_list)
print(int_list)
print(float_list)
print(bool_list)

Lists are highly flexible and allow modifications, including adding, removing, and modifying elements.

Accessing Elements in a List

Elements in a list can be accessed using indexing:

first_element = lipids_list[0]  # Accessing first element
last_element = lipids_list[-1]  # Accessing last element
subset = lipids_list[1:4]  # Slicing from index 1 to 3
print(first_element)
print(last_element)
print(subset)

and elements at a position can be overwritten by a new value:

lipids_list[0] = "Ergosterol" # changing the first element in lipids_list 
print(my_list)

Tuples in Python

A tuple is similar to a list, but it is immutable (cannot be changed after creation). Tuples are defined using parentheses ().

my_tuple = ("a", "b", "c")
num_tuple = (1, 2, 3, 4, 5)

Accessing Elements in a Tuple

Similar to lists, elements in a tuple can be accessed using indexing:

first_element = my_tuple[0]
last_element = my_tuple[-1]
print(first_element)
print(last_element)

Sets in Python

A set is an unordered collection of unique elements, defined using curly braces {}.

my_set = {"a", "b", "c", "a"}
print(my_set) #output: {'a', 'b', 'c'} (duplicates are removed)

Dictionaries in Python

A dictionary is a collection of key-value pairs, similar to named lists in R. It is defined using curly braces {} with keys and values separated by colons :.

mass_dict = {
    "PC 34:0": 761.5935,
    "PC 34:1": 759.5778,
    "PC 34:2": 757.5622,
    "PC 34:3": 755.5465
}

Accessing Elements in a Dictionary

Elements in a dictionary can be accessed using keys:

mz_34_0 = mass_dict["PC 34:0"]  # Accessing the value associated with key 'PC 34:0'
mz_34_1 = mass_dict.get("PC 34:1")  # Another way to access a value safely
print(mz_34_0)
print(mz_34_1)

Note that using square brackets for accessing values in a dictionary results in an error if the provided key isn't present in the dictionary. The .get() method however will not raise an error if the key isn't present but it will return the value None.

mz_34_0 = mass_dict["PC 36:6"]  #This will give a KeyError since "PC 36:6" is not present in the dictionary
mz_34_1 = mass_dict.get("PC 36:1")  # This will not give an error but will asign the value None to the variable
print(mz_34_1) # output: None

Arrays in Python

Python does not have built-in support for arrays like R does for vectors, but arrays can be created using the array module or NumPy.

import numpy as np
arr = np.array([1, 2, 3, 4, 5])

Arrays are useful for numerical operations and are more efficient than lists for large datasets. Similar to lists they can be indexed with square brackets to access or overwrite values at a specified position:

first_element = arr[0] #access the first element of the array
print(first_element)

arr[0] = 10 #overwrite the first element of the array
print(arr)

Other Data Types Used in Python for -Omics Analysis

  • NumPy Arrays: Numpy arrays can be multi-dimensional, for example 2D, to represent matrices.

  • Pandas DataFrames: Similar to R's data frames or tibbles, used for tabular data.

  • Pandas Series: Equivalent to a single column in a DataFrame, similar to a vector in R.

import pandas as pd

# Creating a DataFrame
data = {
    "Sample": ["S1", "S2", "S3"],
    "Lipid_Concentration": [12.5, 15.3, 18.1],
    "Group": ["Healthy", "Patient", "Healthy"]
}
df = pd.DataFrame(data)

Factors in Python (Categorical Data)

In Python, categorical data is handled using Pandas' Categorical type, similar to factors in R.

df["Group"] = df["Group"].astype("category")

Categorical variables are useful for grouping and statistical analysis.

Summary

In Python, lists, tuples, sets, and dictionaries are the core data structures. For handling large datasets, NumPy arrays and Pandas DataFrames are commonly used, especially in bioinformatics and -omics research. Categorical data can be represented using Pandas' Categorical type, aiding in statistical analysis and grouping.

Further reading on data structures in Python:

Last updated