Fundamental data structures
Python has several built-in primitive data types, the most important for data analysis being:
Integer (
int
): Whole numbers, e.g.,x = 10
Floating Point (
float
): Decimal numbers, e.g.,y = 3.14
String (
str
): Sequence of characters, e.g.,text = "Metabolites ID"
Boolean (
bool
): Logical values,True
orFalse
(This can be useful to indicate if a sample or metabolite should be selected from a data table)None (NoneType): None is used to define a null value, or no value at all.
Lists in Python
A list is the most basic data structure in Python used to store data. Unlike R, Python follows zero-based indexing, meaning the first element in a list is accessed with index 0
. Lists in Python can hold heterogeneous data types (mixed types of data in a single list).
Creating a List
Lists are defined using square brackets []
and elements are separated by commas:
Lists are highly flexible and allow modifications, including adding, removing, and modifying elements.
Accessing Elements in a List
Elements in a list can be accessed using indexing:
and elements at a position can be overwritten by a new value:
Tuples in Python
A tuple is similar to a list, but it is immutable (cannot be changed after creation). Tuples are defined using parentheses ()
.
Accessing Elements in a Tuple
Similar to lists, elements in a tuple can be accessed using indexing:
Sets in Python
A set is an unordered collection of unique elements, defined using curly braces {}
.
Dictionaries in Python
A dictionary is a collection of key-value pairs, similar to named lists in R. It is defined using curly braces {}
with keys and values separated by colons :.
Accessing Elements in a Dictionary
Elements in a dictionary can be accessed using keys:
Note that using square brackets for accessing values in a dictionary results in an error if the provided key isn't present in the dictionary. The .get() method however will not raise an error if the key isn't present but it will return the value None.
Arrays in Python
Python does not have built-in support for arrays like R does for vectors, but arrays can be created using the array
module or NumPy
.
Arrays are useful for numerical operations and are more efficient than lists for large datasets. Similar to lists they can be indexed with square brackets to access or overwrite values at a specified position:
Other Data Types Used in Python for -Omics Analysis
NumPy Arrays: Numpy arrays can be multi-dimensional, for example 2D, to represent matrices.
Pandas DataFrames: Similar to R's data frames or tibbles, used for tabular data.
Pandas Series: Equivalent to a single column in a DataFrame, similar to a vector in R.
Factors in Python (Categorical Data)
In Python, categorical data is handled using Pandas' Categorical
type, similar to factors in R.
Categorical variables are useful for grouping and statistical analysis.
Summary
In Python, lists, tuples, sets, and dictionaries are the core data structures. For handling large datasets, NumPy arrays and Pandas DataFrames are commonly used, especially in bioinformatics and -omics research. Categorical data can be represented using Pandas' Categorical
type, aiding in statistical analysis and grouping.
Further reading on data structures in Python:
Last updated