7 Running on linux
7.1 Introduction
Some bioinformatics software may require frequent updates, and the corresponding R packages may not be updated in a timely manner or may not be maintained by anyone at all. Therefore, it is best to use relatively new software for upstream data analysis. One issue is that switching between Linux systems and RStudio can become inconvenient. If you are using RStudio on a server or Linux system, a good approach is to use the R functions system or system2 to call software within the Linux system. This allows us to perform upstream data analysis directly within RStudio.
Conda is an open-source package management system and environment management system that helps you to install and manage software packages and dependencies in various programming languages, including Python, R, Ruby, Lua, Java, and others. Conda allows you to create isolated environments for different projects with their own set of packages and dependencies, so that each project can have its own environment without interfering with other projects.
Conda also provides a central repository of pre-built packages called “conda-forge”, where users can find and install packages easily. With conda, you can easily install, update, and remove packages and dependencies with a simple command-line interface or through graphical interfaces like Anaconda Navigator or Miniconda.
So we can call softwares which are installed in conda environment and use in Rstudio directly. The following chapter we describe how do this in Rstudio and do the Ribo-seq upstream data analysis.
7.2 Prepare conda environment
First you need install Miniconda:
$ wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-py39_23.3.1-0-Linux-x86_64.sh
$ bash Miniconda3-py39_23.3.1-0-Linux-x86_64.sh
Add some channels for conda:
$ conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
$ conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
$ conda config --add channels bioconda
Then create and activate your environment:
$ conda create -n rnaseq
$ conda activate rnaseq
Last install softwares in this environment what you like:
$ conda install fastqc bowtie2 hisat2 samtools adapterremoval
7.3 Qc for fastq files
We call fastqc in conda rnaseq environment and check raw fastq files. call_cmd function allow you to give quoted command and excute them:
# ==============================================================================
# fastqc for fastq
# ==============================================================================
dir.create("2.fastqc-data")
fastqc <- "/root/miniconda3/envs/rnaseq/bin/fastqc"
call_cmd(fastqc,"-h")
# FastQC - A high throughput sequence QC analysis tool
#
# SYNOPSIS
#
# fastqc seqfile1 seqfile2 .. seqfileN
#
# fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam]
# [-c contaminant file] seqfile1 .. seqfileN
# ...
fastq <- list.files("../1.mapping_to_genome/1.raw-data/","*.gz",full.names = T)[1:2]
fastq
# [1] "../1.mapping_to_genome/1.raw-data//SRR12594201.fastq.gz"
# [2] "../1.mapping_to_genome/1.raw-data//SRR12594205.fastq.gz"
# get absolute path
fq_path <- normalizePath(fastq)
fq_path
# [1] "/mnt/f/reanalysis_from_papers/7.ribo-package-test/1.mapping_to_genome/1.raw-data/SRR12594201.fastq.gz"
# [2] "/mnt/f/reanalysis_from_papers/7.ribo-package-test/1.mapping_to_genome/1.raw-data/SRR12594205.fastq.gz"
# run qc
call_cmd(fastqc,"-t 12","-q","-o 2.fastqc-data/",fq_path)
# application/gzip
# application/gzip
# check files
list.files("2.fastqc-data")
# [1] "SRR12594201_fastqc.html" "SRR12594201_fastqc.zip" "SRR12594205_fastqc.html" "SRR12594205_fastqc.zip"
7.4 Trim adapters
Rmove adapters using AdapterRemoval:
# ==============================================================================
# trim adapters
# ==============================================================================
dir.create("3.trim-data")
AdapterRemoval <- "/root/miniconda3/envs/rnaseq/bin/AdapterRemoval"
call_cmd(AdapterRemoval,"-h")
# AdapterRemoval ver. 2.3.3
#
# This program searches for and removes remnant adapter sequences from
# your read data. The program can analyze both single end and paired end
# data. For detailed explanation of the parameters, please refer to the
# man page. For comments, suggestions and feedback please contact Stinus
# Lindgreen (stinus@binf.ku.dk) and Mikkel Schubert (MikkelSch@gmail.com).
#
# If you use the program, please cite the paper:
# Schubert, Lindgreen, and Orlando (2016). AdapterRemoval v2: rapid
# adapter trimming, identification, and read merging.
# BMC Research Notes, 12;9(1):88.
#
# http://bmcresnotes.biomedcentral.com/articles/10.1186/s13104-016-1900-2
# ...
# run trim
# x = 1
output_name <- paste("3.trim-data/",c("test1","test2"),sep = "")
lapply(1:2, function(x){
call_cmd(AdapterRemoval,"--threads 12",
"--adapter1 CTGTAGGCACCATCAAT",
"--trim5p 6","--trim3p 4","--minlength 15",
paste("--file1",fq_path[x]),
"--gzip",
paste("--basename",output_name[x]))
}) -> tmp
# Trimming single ended reads ...
# Opening FASTQ file '/mnt/f/reanalysis_from_papers/7.ribo-package-test/1.mapping_to_genome/1.raw-data/SRR12594201.fastq.gz', line numbers start at 1
# Processed a total of 27,170,397 reads in 3:07.2s; 145,000 reads per second on average ...
# Trimming single ended reads ...
# Opening FASTQ file '/mnt/f/reanalysis_from_papers/7.ribo-package-test/1.mapping_to_genome/1.raw-data/SRR12594205.fastq.gz', line numbers start at 1
# Processed a total of 26,741,698 reads in 2:27.8s; 180,000 reads per second on average ...
# check clean fastqs
list.files("3.trim-data/")
# [1] "test1.discarded.gz" "test1.settings" "test1.truncated.gz"
# [4] "test2.discarded.gz" "test2.settings" "test2.truncated.gz"
7.5 Remove tRNA and rRNA contamination
We use bowtie2 to align reads on trRNA index:
# ==============================================================================
# remove rRNA and tRNA contamination
# ==============================================================================
dir.create("4.rmtrRNA-data")
bowtie2 <- "/root/miniconda3/envs/rnaseq/bin/bowtie2"
call_cmd(bowtie2,"-h")
# Bowtie 2 version 2.5.1 by Ben Langmead (langmea@cs.jhu.edu, www.cs.jhu.edu/~langmea)
# Usage:
# bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r> | --interleaved <i> | -b <bam>} [-S <sam>]
#
# <bt2-idx> Index filename prefix (minus trailing .X.bt2).
# NOTE: Bowtie 1 and Bowtie 2 indexes are not compatible.
# <m1> Files with #1 mates, paired with files in <m2>.
# Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
# <m2> Files with #2 mates, paired with files in <m1>.
# Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
# ...
# run mapping
clean_fq <- list.files("3.trim-data/","*.truncated.gz",full.names = T)
output_name <- paste("4.rmtrRNA-data/",c("test1_rmtrRNA.fq.gz","test2_rmtrRNA.fq.gz"),sep = "")
lapply(1:2, function(x){
call_cmd(bowtie2,
"-x ../1.mapping_to_genome/0.index-data/rtRNA-index/GRCm38_trRNA",
"--threads 12",
paste("--un-gz",output_name[x]),
paste("-U",clean_fq[x]),
paste("-S ",output_name[x],".sam",sep = ""))
# remove tmp files
file.remove(list.files("4.rmtrRNA-data/","*.sam",full.names = T))
}) -> tmp
# 27154794 reads; of these:
# 27154794 (100.00%) were unpaired; of these:
# 11156550 (41.09%) aligned 0 times
# 277586 (1.02%) aligned exactly 1 time
# 15720658 (57.89%) aligned >1 times
# 58.91% overall alignment rate
# 26706601 reads; of these:
# 26706601 (100.00%) were unpaired; of these:
# 4396448 (16.46%) aligned 0 times
# 171860 (0.64%) aligned exactly 1 time
# 22138293 (82.89%) aligned >1 times
# 83.54% overall alignment rate
# check files
list.files("4.rmtrRNA-data/","*.gz")
# [1] "test1_rmtrRNA.fq.gz" "test2_rmtrRNA.fq.gz"
7.6 Align reads on genome
We use hisat2 align the clean reads on genome:
# ==============================================================================
# mapping on genome
# ==============================================================================
dir.create("5.map-data")
hisat2 <- "/root/miniconda3/envs/rnaseq/bin/hisat2"
samtools <- "/root/miniconda3/envs/rnaseq/bin/samtools"
call_cmd(hisat2,"-h")
# HISAT2 version 2.2.1 by Daehwan Kim (infphilo@gmail.com, www.ccb.jhu.edu/people/infphilo)
# Usage:
# hisat2 [options]* -x <ht2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <sam>]
#
# <ht2-idx> Index filename prefix (minus trailing .X.ht2).
# <m1> Files with #1 mates, paired with files in <m2>.
# Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
# <m2> Files with #2 mates, paired with files in <m1>.
# Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
# ...
# run mapping
clean_fq <- list.files("4.rmtrRNA-data/","*.gz",full.names = T)
output_name <- paste("5.map-data/",c("test1.bam","test2.bam"),sep = "")
lapply(1:2, function(x){
call_cmd(hisat2,
"-x ../1.mapping_to_genome/0.index-data/ref-index/GRCm38_ref",
"--threads 12",
paste("-U",clean_fq[x]),
paste("|",samtools,"sort","-@ 6 -o",output_name[x]))
}) -> tmp
# 11156550 reads; of these:
# 11156550 (100.00%) were unpaired; of these:
# 3307706 (29.65%) aligned 0 times
# 5063068 (45.38%) aligned exactly 1 time
# 2785776 (24.97%) aligned >1 times
# 70.35% overall alignment rate
# [bam_sort_core] merging from 0 files and 6 in-memory blocks...
# 4396448 reads; of these:
# 4396448 (100.00%) were unpaired; of these:
# 1699590 (38.66%) aligned 0 times
# 1140442 (25.94%) aligned exactly 1 time
# 1556416 (35.40%) aligned >1 times
# 61.34% overall alignment rate
# [bam_sort_core] merging from 0 files and 6 in-memory blocks...
Make index for bam files:
# index for bam
call_cmd("ls","5.map-data/*.bam","| xargs -i",samtools,"index -@ 12 {}")
# check files
list.files("5.map-data/","*")
# [1] "test1.bam" "test1.bam.bai" "test2.bam" "test2.bam.bai"
As you see, you can call linux softwares easily in Rstudio and do not need to turn into linux system and write command line. This is great.
Just try it now and see what you can do.