Preprocess data type using Tidyverse package
A part of preparing data for analysis and visualization in OMICs analysis
Last updated
A part of preparing data for analysis and visualization in OMICs analysis
Last updated
As mentioned in 'Loading data into R', the read_xlsx() function contains the argument 'col_type' set by default to NULL. Therefore, data types in every column data are guessed. Before performing any analysis or visualization, it is necessary to recheck the data type in every column and adjust it accordingly, i.e., sample names should be characters (<chr>), grouping variables factors (<fct>), and concentrations - numeric variables (<dbl> or <num>).
For a single column, the data type can be examined using the following set of base R functions:
The glimpse() function from the pillar package (the tidyverse collection - ) allows checking all column types in your lipidomics or metabolomics data at once. We mentioned it already in the 'Data frame or tibble?' subchapter. Take a look at the code below:
The function generates the following output:
A black frame highlights column types. The first column contains Sample Name
(string of letters and numbers), and it was recognized correctly as a character vector <cht>. The Label
(in red frame) column contains a grouping variable, which was guessed as a character vector. It will be necessary to change the data type stored in this column to factor <fct>. All lipid concentrations were correctly recognized as numeric vectors <dbl>.
To adjust the data type stored in a column, we will initially use base R functions: as.character(), as.factor(), as.numeric(), as.integer(), and as.logical(). Later, we will also apply mutate() function from the dplyr package (see chapter Useful R tricks and features in OMICs mining - Data wrangling syntaxes). Changing the variable type will also require introducing a new symbol - dollar ($). Using $, it is possible to access, add, delete, change, and update variables from lists or columns of a data frame. Changing the column type from character to factor can be achieved using the following line of code:
In this line of code, we accessed the Label
column and then changed it into a factor column. The `...` are called backtick signs. In the case of the Label
column, the backticks are not necessary. However, all column names containing spaces, semicolons, and other signs that are not allowed in R column names should be referred to using the backticks signs. For the updated shorthand notations of lipids, it will be necessary to use backticks whenever referring to columns containing lipid concentrations.
More examples will also be shown in the next chapters of the Gitbook, but we would proceed similarly in the case of other columns, e.g.:
After executing the first line of code in this subchapter, the Label
column type was changed from character to factor:
Label
column is highlighted in the red frame.Label
.