Fundamental data structures
A part of preparing data for analysis and visualization in OMICs analysis
Vectors in R
A vector is the most basic R object used to store data. Importantly, in R, indexing of a vector starts from '1', not from '0' (which is opposite to Python). Vectors contain homogenous data types. In R, two types of vectors can be distinguished:
(Atomic) vectors:
Character (string) <chr>
Each element of such a vector is a string of one or more characters, e.g.:
Integer <int>
Each element of such a vector is an integer (a whole number, not a fraction) or NA, e.g.:
The L
allows differing integers from numeric vectors.
Double <dbl> or numeric <num>
Each element of such a vector can contain a number which can be double type, but also values like NA, NaN, Inf, -Inf
Logical <logi>
This vector contains TRUE, FALSE, or NA entries, e.g.:
Complex <cplx>
This vector type allows for storing numbers with imaginary components, e.g.:
Raw <raw>
The raw type vector is intended to hold raw bytes, e.g.:
Recursive vectors:
list
In R, lists contain heterogeneous elements. Lists can store, for example, numeric vectors, integer vectors, strings, and matrices, as well as other lists inside of one list.
The vector types that are often used in lipidomics and metabolomics data are character and double (numeric) vectors and recursive vectors - lists. In some cases, we may also need integer or logical vectors.
Other data types used in R for OMICs analysis
matrix (matrices) - we can think of a matrix as a vector with two-dimensional shape information (e.g., all lipid or metabolite concentrations only).
data frames (and tibbles) - are lists with heterogeneous vector elements of the same length (e.g. the whole set of lipidomics or metabolomics data containing sample names, biological groups, clinical data, and concentrations of lipids/metabolites). By separating a column of a data frame, we can obtain a vector. In tidy data frames, one column represents one variable (a feature, lipid concentration, metabolite concentration, gender, age, tumor grade, smoking status, etc.), every row represents one observation (one patient for whom all variables are collected in columns), and values are in cells.
Both - matrices and data frames (tibbles) will be used by us while working with lipidomics and metabolomics data sets.
Further reading about data types in R:
Factors in R
Factors <fct> are categorical variables in R (or grouping variables). This data format is widely used in statistics. The factors are labels used to denote biological groups in -omics data. In the case of our clinical data example, we will change the Label
column from character to factor. The Label
column contains data on a biological group type that every sample belongs to, e.g., N - healthy volunteers, PAN - patients with pancreatitis, and T - patients with pancreatic cancer. Factors have a limited set of values. Factors in R are stored as a vector of integer values, and character values are displayed when a vector containing factors is called. More information about factors can be found here:
Last updated