💪
Omics data visualization in R and Python
  • Introduction
    • From Authors
    • Virtual environments - let's begin
    • Getting started with Python
    • Getting started with R
    • Example data sets
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R
    • Fundamental data structures
    • Loading data into R
    • Preferred formats in metabolomics and lipidomics analysis
    • Preprocess data type using Tidyverse package
    • Useful R tricks and features in OMICs mining
      • Application of pipe (%>%) functions
      • Changing data frames format with pivot_longer()
      • Data wrangling syntaxes useful in OMICs mining
      • Writing functions in R
      • The 'for' loop in R (advanced)
  • PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING PYTHON
    • Fundamental data structures
    • Loading data into Python
  • Missing values handling in R
    • Missing values – Introduction
    • Detecting missing values (DataExplorer R package)
    • Filtering out columns containing mostly NAs
    • Data imputation by different available R libraries
      • Basic data imputation in R with dplyr and tidyr (tidyverse)
      • Data imputation using recipes library (tidymodels)
      • Replacing NAs via k-nearest neighbor (kNN) model (VIM library)
      • Replacing NAs via random forest (RF) model (randomForest library)
  • Missing values handling in Python
    • Detecting missing values
    • Filtering out columns containing mostly NAs
    • Data imputation
  • Data transformation, scaling, and normalization in R
    • Data normalization in R - fundamentals
    • Data normalization to the internal standards (advanced)
    • Batch effect corrections in R (advanced)
    • Data transformation and scaling - introduction
    • Data transformation and scaling using different available R packages
      • Data transformation and scaling using mutate()
      • Data transformation and scaling using recipes R package
      • Data Normalization – bestNormalize R package
  • Data transformation, scaling, and normalization in Python
    • Data Transformation and scaling in Python
  • Metabolites and lipids descriptive statistical analysis in R
    • Computing descriptive statistics in R
    • Using gtsummary to create publication-ready tables
    • Basic plotting in R
      • Bar charts
      • Box plots
      • Histograms
      • Density plots
      • Scatter plots
      • Dot plots with ggplot2 and tidyplots
      • Correlation heat maps
    • Customizing ggpubr and ggplot2 charts in R
    • Creating interactive plots with ggplotly
    • GGally for quick overviews
  • Metabolites and lipids descriptive statistics analysis in Python
    • Basic plotting
    • Scatter plots and linear regression
    • Correlation analysis
  • Metabolites and lipids univariate statistics in R
    • Two sample comparisons in R
    • Multi sample comparisons in R
    • Adjustments of p-values for multiple comparisons
    • Effect size computation and interpretation
    • Graphical representation of univariate statistics
      • Results of tests as annotations in the charts
      • Volcano plots
      • Lipid maps and acyl-chain plots
  • Metabolites and lipids univariate statistical analysis in Python
    • Two sample comparisons in Python
    • Multi-sample comparisons in Python
    • Statistical annotations on plots
  • Metabolites and lipids multivariate statistical analysis in R
    • Principal Component Analysis (PCA)
    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)
    • Partial Least Squares (PLS)
    • Orthogonal Partial Least Squares (OPLS)
    • Hierarchical Clustering (HC)
      • Dendrograms
      • Heat maps with clustering
      • Interactive heat maps
  • Metabolites and lipids multivariate statistical analysis in Python
    • Principal Component Analysis
    • t-Distributed Stochastic Neighbor Embedding
    • Uniform Manifold Approximation and Projection
    • PLS Discriminant Analysis
    • Clustered heatmaps
  • OMICS IN MACHINE LEARNING APPROACHES IN R AND PYTHON
    • Application of selected models to OMICs data
    • OMICs machine learning – Examples
  • References
    • Library versions
Powered by GitBook
On this page
  • Setting up the working directory
  • Using RStudio's GUI to set the directory
  • Reading data into R
  • Using absolute and relative paths
  • Reading Excel files using readxl package
  1. PERFORMING FUNDAMENTAL OPERATIONS ON OMICs DATA USING R

Loading data into R

A part of preparing data for analysis and visualization for OMICs analysis

PreviousFundamental data structuresNextPreferred formats in metabolomics and lipidomics analysis

Last updated 4 months ago

In R, lipidomics and metabolomics data sets can be handled via data frames ("data.frame"). Data frames are composed of rows (horizontally aligned data) and columns (vertically aligned data). The data frames will be discussed in more detail in the next chapter. Here, we will show you how to read your lipidomics or metabolomics data set into R. We can upload data into R as an Excel table (.csv, .xlsx, etc.) and load the data using the data frame format. The example dataset is:

Setting up the working directory

Setting up the working directory is an essential first step for several reasons. It ensures that all your scripts, data files, and output are stored in a structured and consistent location. When you read or save files (read.csv(), write.csv(), etc.), you don’t need to specify long file paths every time. Your scripts will work across different sessions and computers if they assume the correct working directory from the start.

First, we must verify that we are in the right working directory ("wd") and, if not, specify it. Take a look at the code block below:

# Working directory (wd) is a default location on your computer used by R for reading files
# Frequently, wd is simply the 'Documents' folder. To check your wd, type:
getwd()

# To change wd to the path you would prefer more than 'Documents', use setwd(), e.g.:
setwd("D:/.../.../...")

# or
setwd('D:/.../.../...')  

# For example, if you want your data to be read from a folder 'Data analysis' (D: drive), type:
setwd('D:/Data analysis/')

# The strings with a working directory inside of setwd() can be:
# 1) in double quotation marks ("..."), 
# 2) or single quotation marks ('...').

Using RStudio's GUI to set the directory

Additionally, you can use RStudio's GUI to set the directory. A simple user-friendly way to find a path to your desired folder is to use the RStudio "Files" panel. First, (1) use the three dots or the "Go to directory" button to locate and open the appropriate folder. Once the desired folder is open, (2) click on the "More" button, and choose "Copy Folder Path to Clipboard". Now the absolute path is copied in your clipboard and you can easily paste it in the setwd() function inside of the quotation marks.

Also, you can set working directory here by choosing "Set As Working Directory".

Reading data into R

Using absolute and relative paths

To read any kind of file into R, first you must specify the exact path to the file. There are two ways to do this:

Absolute Path specifies the full path to a file, starting from the root directory (or drive letter on Windows). It is an exact, fixed location of the file on your computer.

# Reading data using absolute path
read.csv("D:/Data analysis/Lipidomics_data.xlsx")

Relative Path is defined in relation to the current working directory (the folder where your R script is running or where RStudio is set to). It’s more flexible and portable, making it better for reproducibility.

# Reading data using relative path
setwd("D:/Data analysis")
read.csv("Lipidomics_data.xlsx")

Reading Excel files using readxl package

For reading Excel files (.xlsx and .xls), we can use the function read_excel(), that automatically detects file type. We can also specify the file type read_xlsx(), or read_xls(). We can load our data into data frame format, which is stored in the 'data' variable in the global environment:

# Loading lipidomics data into R
# Step 1 - install the library. 
# See more info in chapter: Getting started with R.
# The package readxl can be downloaded from CRAN.
install.packages("readxl")

# Step 2 - call library / load library
library(readxl)

# Step 3 - set your wd:
setwd("D:/Data analysis/")

# Step 4 - load data into R, e.g.:
data <- read_xlsx(file.choose())
# In some cases, the pop-up window remains hidden behind the RStudio interface!

And a glimpse at RStudio after executing these lines of code:

Now, let's take a look at the arguments of the read_xlsx() function. Type:

# Check help for information about arguments of read_xlsx() function
?read_xlsx()

This will open for you the R documentation in the help tab about this function:

As you see, the read_xlsx() function contains multiple arguments, which can be very useful once you become more experienced in R. A detailed description of the input arguments is possible to find in the documents of the package:

The functions from the readxl package will need a specific path to your .xlsx file containing the data of interest, or - you can use the file.choose() option. If your .xlsx file contains more than one sheet, you can select a specific sheet that you want to load into R by defining its number or name. Furthermore, you can even introduce data from specific cells in the Excel file, giving a range. The argument col_names set by default to TRUE or T enables treating the first row of data as column names. The col_types argument enables setting a type of data stored in a column. By default, col_types is set to NULL, which means that the column type is guessed and may require adjustments in the next steps. By default, all blank cells will be interpreted as missing entries ('NA' values). Below is shown how to use some of these arguments:

# Loading data into R from a defined path (examples): 
data <- read_xlsx(path = "D:/Data analysis/Lipidomics_dataset.xlsx")

# or
data <- read_xlsx(path = 'D:/Data analysis/Lipidomics_dataset.xlsx')

# If you put your data in the working directory, you can import it using just a file name
data <- read_xlsx("Lipidomics_dataset.xlsx")

# or 
data <- read_xlsx('Lipidomics_dataset.xlsx')

# Recap from above:
# To change wd to the path you would prefer more than 'Documents', use setwd(), e.g.:
setwd("D:/.../.../...")

# or
setwd('D:/.../.../...')  

# For example, if you want your data to be read from a folder 'Data analysis' (D: drive), type:
setwd('D:/Data analysis/')

# Import data from a specific sheet of the Excel file (if your data are stored in wd):
data <- read_xlsx('Lipidomics_dataset.xlsx', sheet = 2) # using sheet number

# or 
data <- read_xlsx('Lipidomics_dataset.xlsx', sheet = 'sheet_name') # using sheet name

# Import data from a specific sheet of the Excel file (your data are not stored in wd):
# Using file.choose()
data <- read_xlsx(file.choose(), sheet = 2) # using sheet number

# or 
data <- read_xlsx(file.choose(), sheet = "sheet_name") # using sheet name

# or by providing a path: 
data <- read_xlsx(path = "D:/Data analysis/Lipidomics_dataset.xlsx", sheet = 2) # using sheet number

# or 
data <- read_xlsx(path = "D:/Data analysis/Lipidomics_dataset.xlsx", sheet = "sheet_name") # using sheet name
 #
# Reading in data from a selected range of cells if your data are in the wd:
data <- read_xlsx('Lipidomics_data.xlsx', range = "A1:I228")

# or - if we want to select data from a defined sheet & range of cells:
data <- read_xlsx('Lipidomics_data.xlsx', sheet = 2, range = 'A1:I220')

# or - if we want to select data from a sheet defined by name & range of cells:
data <- read_xlsx('Lipidomics_data.xlsx', sheet = 'sheet_name', range = 'A1:M50')

To view your whole data frame in a separate tab next to your script, go to the global environment and click on the variable name you created - 'data'. You can also type in the console:

View(data)

Your data set is now ready for the next steps. The complete script for introducing data into R is attached below:

The readxl package documentation.
348KB
Lipidomics_dataset.xlsx
Exemplary data set which will be used in this Gitbook for data analysis and visualization.
3KB
Reading tabular data into R via readxl.R
R script for reading data into R.
Setting up the working directory using the "Files" panel.
Data frame containing lipidomics data read into R and stored as 'data' in the global environment.
In red frame - information about read_xlsx() function' arguments.
Example of the readxl documentation.
Lipidomics data set overview in RStudio.