Chapter 4 Track data input

There are many tools or softwares to visualize NGS(Next Generation Sequencing) data including ChIP-SEQ, ATAC-SEQ, RNA-SEQ, Hic, HiChIP and so on. The R packages Gviz, plotgardener, ggcoverage and ggbio are some popular tools to visualize NGS data in R. Besides, the online or local softwares like IGV, Wubrowse and ucsc genome browser. There are still some limitations for some of them to make a High-Quality Graphics for Publication with less code and less time. Here I supply some utilities and functions totally based on ggplot2 package to visualize NGS data in R and create a nice graph with high-quality for publication.

Though I have written a similar package transPlotR which plays some same roles on generating tracks plot. Some limitations and shortcomings come to me when I use this package. So I devote myself again to develop and expand the functions.

The main plot function is trackVisProMax function and this is combined with some data load input functions for create a track graph.

4.1 Load signal data

bigwig format is binary of wig which’s data size is samller. I recommend use this format, “wig” and “bedGraph” format are also accepted, you should just define format parameter. loadBigWig function can read bigwig data into R which based on rtracklayer::import.bw. Here are some examples to load your bigwig data:

library(BioSeqUtils)

# load bigwig files
file <- list.files(path = "test-bw/",pattern = '.bw',full.names = T)
file
# [1] "test-bw/1cell-m6A-1.bw" "test-bw/1cell-m6A-2.bw" "test-bw/1cell-RNA-1.bw"
# [4] "test-bw/1cell-RNA-2.bw" "test-bw/2cell-m6A-1.bw" "test-bw/2cell-m6A-2.bw"
# [7] "test-bw/2cell-RNA-1.bw" "test-bw/2cell-RNA-2.bw"

# select some chromosomes for test
bw <- loadBigWig(bw_file = file,chrom = c("5","15"),format = "bw")

# check
head(bw,3)
#   seqnames   start     end   score    fileName
# 1       15       1 3054635 0.00000 1cell-m6A-1
# 2       15 3054636 3054640 1.34079 1cell-m6A-1
# 3       15 3054641 3054715 2.68159 1cell-m6A-1

For saving space, we selected chromosome 5 and chromosome 15 for each file, if you don’t specify chrom parameter, loadBigWig will return all chromosomes. You can also specify file_name to assign a new name for your each bigwig data.

4.2 Load peaks data

loadBed function allows you to read peaks data in R which ias baesd on rtracklayer::import.bed. Usually we will select the first three columns for downstream analysis. You can also specify file_name to assign a new name for your each bed data. Here is a example to read peaks data:

bedfile <- list.files(path = "./",pattern = ".bed")
# [1] "peaks.bed"  "peaks2.bed"

bed_df <- loadBed(bedfile)

# check
head(bed_df,3)
#   seqnames     start       end sampleName y
# 1        5 142905501 142905600      peaks 1
# 2        5 142903201 142903800      peaks 1
# 3       15  61985342  61985900      peaks 1

4.3 Load links data

Links data often describe two interaction sites on genomic positions. Hic and HiChIP technologies can achive this goal. The data format can be bed and bedpe format. Or you can supply with only 4 columns(chrom, start, end, value) format data. loadloops function allows you to read these format data in R. Example shows in the following code:

Note: You should supply file_name for each file.

loop_file <- list.files("test-bw2/",pattern = ".bedpe$",full.names = T)
loop_file
# [1] "test-bw2/C1-CTCF.bedpe"    "test-bw2/C1-H3K27ac.bedpe" "test-bw2/M1-CTCF.bedpe"   
# [4] "test-bw2/M1-H3K27ac.bedpe"

file_name = c("C1-CTCF","C1-H3K27ac","M1-CTCF","M1-H3K27ac")

# test code
loop_data <- loadloops(loop_file = loop_file,file_name = file_name,
                       sep = " ")

# check
head(loop_data,3)

#   seqnames     start       end    score fileName
# 1    chr10 100002774 100022436 0.021354  C1-CTCF
# 2    chr10 100002774 100069170 0.068404  C1-CTCF
# 3    chr10 100002774 100185646 0.184670  C1-CTCF

4.4 Load Hic related data

The Hic related data format is multiple includes .h5, .hic, .cool, .mcool and so on. It mainly depends on what kind of soft/tool in the upstream. The usual format is .hic and .cool. So we foucus on these fromat for visualization. You can use hicConvertFormat command in HiCExplorer software to covert into suitable format if you have other format file.

Different upstream tools will generate different resolution matrix data, please make sure what resolution the Hic data you are using. For .hic data, you can use strawr::readHicBpResolutions function to check avaliable resolutions to use, examples shows here:

# install.packages("strawr")
library(strawr)

# test data
readHicBpResolutions(system.file("extdata", "test.hic", package = "strawr"))
# [1] 2500000

# real data
readHicBpResolutions("test-bw2/RPE-ICRF193_5uM.hic")
# [1] 2500000 1000000  500000  250000  100000   50000   25000   10000    5000

readHicChroms function can be used to vies chromosome names for your data:

# test data
readHicChroms(system.file("extdata", "test.hic", package = "strawr")) %>% 
  head()

#   name    length
# 1    1 249250621
# 2   10 135534747
# 3   11 135006516
# 4   12 133851895
# 5   13 115169878
# 6   14 107349540

# real data
readHicChroms("test-bw2/RPE-ICRF193_5uM.hic") %>% head()
#   name    length
# 1    1 249250621
# 2   10 135534747
# 3   11 135006516
# 4   12 133851895
# 5   13 115169878
# 6   14 107349540

prepareHic function allows you to read .hic and .cool data. Usually these data are big size and will spend a lot of space and memory. Please making sure you have enough soures to deal with it. Here are some eamples:

Test data can be fetched on GSE200160.

For .hic data, we use plotgardener::readHic function to read data into R and it is has been upper-triangle matrix format.

hic_data <- list.files("test-bw2/",pattern = ".hic",full.names = T)
hic_data
# [1] "test-bw2/RPE-doxorubicin_02uM.hic"  "test-bw2/RPE-doxorubicin_034uM.hic"
# [3] "test-bw2/RPE-ICRF193_5uM.hic"

hic_df <- prepareHic(hic_path = hic_data,
                     file_name = c("doxorubicin_02uM","doxorubicin_34uM",
                                   "ICRF193_5uM"),
                     chrom = "1",assembly = "hg19",
                     resolution = 10000)

# check
head(hic_df[1:3,])
#   seqnames  start   end     score         fileName id
# 1     chr1 710000  5000  72.90664 doxorubicin_02uM  1
# 2     chr1 715000     0 323.53738 doxorubicin_02uM  2
# 3     chr1 715000 10000 212.93642 doxorubicin_02uM  3

You can extract multiple chromosomes for multiple data:

hic_df <- prepareHic(hic_path = hic_data,
                     file_name = c("doxorubicin_02uM","doxorubicin_34uM",
                                   "ICRF193_5uM"),
                     chrom = c("1","2","3"),assembly = "hg19",
                     resolution = 10000)

# check
head(hic_df[1:3,])
#   seqnames  start   end     score         fileName id
# 1     chr1 710000  5000  72.90664 doxorubicin_02uM  1
# 2     chr1 715000     0 323.53738 doxorubicin_02uM  2
# 3     chr1 715000 10000 212.93642 doxorubicin_02uM  3

Besides, you can define a region to extract:

hic_df <- prepareHic(hic_path = hic_data,
                     file_name = c("doxorubicin_02uM","doxorubicin_34uM",
                                   "ICRF193_5uM"),
                     chrom = "1",assembly = "hg19",
                     chromstart = 20000000, chromend = 47500000,
                     resolution = 10000)

# check
head(hic_df[1:3,])
#   seqnames    start  end     score         fileName id
# 1     chr1 19995000    0  96.48324 doxorubicin_02uM  1
# 2     chr1 20000000 5000  38.00954 doxorubicin_02uM  2
# 3     chr1 20005000    0 102.91180 doxorubicin_02uM  3

Test data can be fetched on GSE222637.

For .cool data format, prepareHic runs slowly and spend much memory:

hic_data <- list.files("test-bw3/",pattern = ".cool",full.names = T)
hic_data
# [1] "test-bw3/HiC_Adril-1_10kb.cool" "test-bw3/HiC_Ctrl-1_10kb.cool"

hic_df <- prepareHic(hic_path = hic_data,
                     chrom = "chr1",
                     resolution = 10000)

# check
head(hic_df,3)
#   seqnames    start      end score         fileName id
# 1     chr1    -5000        0     4 HiC_Adril-1_10kb  1
# 2     chr1    80000    85000     1 HiC_Adril-1_10kb  2
# 3     chr1 59660000 59665000     1 HiC_Adril-1_10kb  3

4.5 Extract junction data

loadJunction can be used to load junctions data from your own bed format data which records differential splice sites information from other tools identified or from your bam file. The latter we use megadepth::bam_to_junctions to extract all junctions data and return a data frame format. More details see megadepth. Here we show examples:

bam_file <- list.files(path = "F:/junc-test/",
                       pattern = ".bam$",full.names = T)
bam_file
# [1] "F:/junc-test/C1.sorted.bam" "F:/junc-test/WT.sorted.bam"

junc_df <- loadJunction(data_path = bam_file,
                        file_name = c("C1","WT"))
junc_df <- junc_df %>% dplyr::filter(score >= 5)

# check
head(junc_df,3)
#   seqnames   start     end score fileName
# 1        1 3154117 3159706     1       C1
# 2        1 3207318 3213608     1       C1
# 3        1 4492669 4493099     8       C1

The score stands for read count of junctions. A little time you will spend if you extarct all junctions site from bam files. I recommend you featch the significant junction sites information from other softwares for visualization.