2 Sequence Extraction

2.1 Preparing transcript sequences

The riboTransVis package provides several functions to extract transcript and CDS sequences, facilitating seamless integration with downstream analyses within the package.

2.2 Renaming transcript sequences from Ensembl

If you are aligning reads to the transcriptome and working with transcriptome-aligned BAM files, riboTransVis requires transcript sequence names in a specific format: transcript_ID|gene_name.

If you downloaded transcriptome sequence files from the Ensembl database, you can use the prepare_transcript_file() function to convert the sequence names to the required format. Below is an example using a transcriptome FASTA file of S. cerevisiae (yeast) downloaded from Ensembl:

wget https://ftp.ensembl.org/pub/release-113/fasta/saccharomyces_cerevisiae/cdna/Saccharomyces_cerevisiae.R64-1-1.cdna.all.fa.gz

zless Saccharomyces_cerevisiae.R64-1-1.cdna.all.fa.gz

>YPL071C_mRNA cdna chromosome:R64-1-1:XVI:420048:420518:-1 gene:YPL071C gene_biotype:protein_coding transcript_biotype:protein_coding description:Putative protein of unknown function; green fluorescent protein (GFP)-fusion protein localizes to both the cytoplasm and the nucleus [Source:SGD;Acc:S000005992]
ATGAGTTCCCGGTTTGCAAGAAGTAATGGCAATCCCAACCACATTAGGAAAAGAAATCAT
TCTCCAGACCCAATAGGAATTGATAATTATAAAAGAAAAAGACTAATTATAGATTTAGAG
AATTTATCCTTAAATGATAAAGGGCCCAAGAACGGACATGCAGATGATAACAATCTTATT
CATAACAATATAGTATTCACAGACGCTATTGATGATAAGGTCCTGAAAGAGATCATCAAG
TGTTCCACAAGTAAACGCGGCGACAATGACTTGTTTTATGACAAAATATGGGAACGTTTG
AGAGAAAAAAGGCTACAAATAATAAAATGGGTAGATTATAAGGAAATTGCTTATCTAAGC
TGGTGGAAGTGGTTCCATAATCAAATGACTTCGAAATACACTTATGATGGAGAGGCTGAT
ACCGATGTTGAAATGATGGCAGTGGATACTGATGTGGATATGGATGCGTAA
>YLL050C_mRNA cdna chromosome:R64-1-1:XII:39804:40414:-1 gene:YLL050C gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:COF1 description:Cofilin, involved in pH-dependent actin filament depolarization; binds both actin monomers and filaments and severs filaments; involved in the selective sorting, export of the secretory cargo from the late golgi; genetically interacts with pmr1; thought to be regulated by phosphorylation at SER4; ubiquitous and essential in eukaryotes [Source:SGD;Acc:S000003973]
ATGTCTAGATCTGGTGTTGCTGTTGCTGATGAATCCCTTACCGCTTTCAATGACTTGAAA
TTGGGTAAAAAATACAAATTTATTTTATTCGGATTGAACGATGCTAAAACCGAAATCGTT
GTCAAGGAAACCTCTACTGACCCATCTTACGATGCCTTCTTAGAGAAATTGCCAGAAAAC
GACTGTCTTTACGCCATTTACGATTTTGAATACGAAATTAATGGTAATGAAGGTAAGAGA
TCCAAGATTGTTTTCTTCACTTGGTCTCCAGACACTGCTCCAGTCAGATCTAAGATGGTC
TATGCATCCTCCAAGGATGCCTTAAGAAGAGCCTTAAACGGTGTCTCTACCGATGTTCAA
GGTACTGATTTTTCCGAAGTTTCTTACGATTCTGTTTTGGAAAGAGTCAGCAGAGGCGCT
GGTTCTCATTAA

Simply providing the input FASTA file and specifying the output FASTA file is sufficient to complete the conversion:

library(riboTransVis)

prepare_transcipt_file(transcript_fa = "../Saccharomyces_cerevisiae.R64-1-1.cdna.all.fa",
                       output_path = "../sac_trans.fa")

The figure below shows a comparison of the transcriptome FASTA file before and after ID conversion:

2.3 Extracting the transcriptome File

Alternatively, riboTransVis provides the get_transcript_sequence() function, which can be used to extract all transcript sequences based on the genome FASTA file and the corresponding GTF annotation file.

Since most genes in yeast species lack annotated UTR regions, it is often beneficial to extend the transcript sequences upstream and downstream to facilitate Ribo-seq–related analyses. This can be done using the extend parameter. Typically, an extension of 50 nucleotides is sufficient.

get_transcript_sequence(genome_file = "../../index-data/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa",
                        gtf_file = "../../index-data/Saccharomyces_cerevisiae.R64-1-1.112.gtf",
                        feature = "exon",
                        extend = T,
                        extend_upstream = 50,
                        extend_downstream = 50,
                        output_file = "sac_trans.fa")

2.4 Extracting CDS sequences

In certain analyses, coding sequence (CDS) information may be required. You can extract the CDS regions from all transcripts by setting feature = “CDS”:

get_transcript_sequence(genome_file = "../../index-data/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa",
                        gtf_file = "../../index-data/Saccharomyces_cerevisiae.R64-1-1.112.gtf",
                        feature = "CDS",
                        output_file = "sac_cds.fa")

2.5 Extracting longest transcript sequences

The get_longest_transcript function extracts the longest transcript sequence.
The selection criteria are as follows:

1. Transcripts are first sorted by the length of their CDS (coding sequence), and the one with the longest CDS is selected.
2. If multiple transcripts have the same CDS length, then the one with the longest full transcript sequence is chosen.

Here is an example:

lt <- get_longest_transcript(genome_file = "Homo_sapiens.GRCh38.dna.primary_assembly.fa",
                             gtf_file = "Homo_sapiens.GRCh38.94.gtf.gz",
                             output_file = "longest_trans.fa")

ltfa <- Biostrings::readDNAStringSet("longest_trans.fa")
ltfa
# DNAStringSet object of length 57169:
#      width seq                                                                                         names               
# [1]   1657 GTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTC...GCACAGCTAGAGATCCTTTATTAAAAGCACACTGTTGGTTTCTG ENST00000456328|D...
# [2]    712 GTGCACACGGCTCCCATGCGTTGTCTTCCGAGCGTCAGGCCGCC...AACTTGGACTTCCAAGCCTCCAGAACTGTGAGGGATAAATGTAT ENST00000473358|M...
# [3]    138 GGATGCCCAGCTAGTTTGAATTTTAGATAAACAACGAATAATTT...AACATTATTGGTTGTTTATCTGAGATTCAGAATTAAGCATTTTA ENST00000607096|M...
# [4]    840 GCGGTATCTAAATTTGTATTGATTGGACTTTCAAGCTCTTGGGA...CACCCCTGTCTAGGATCTACACATTAAGAAACAAAGACATGAAC ENST00000606857|O...
# [5]   1414 AGCTATCTGAATTTCTCCTTCTCCTAAAAATGCACATCCTATGA...AAAAAGTATATATTTCTATCTAATGTGTGTATCTAATTAACAGC ENST00000642116|O...
# ...    ... ...
# [57165]    603 AAATCTGCTCCCGGGGGTATTCTTGACTTAAACAAGGTTGCAAC...AAGAAAATCCTCAGCAAAGTGAAGAATTGCTTGAAGTAAGCAAC ENST00000620795|A...
# [57166]    603 AAATCTGCTCCCGGGGGTATTCTTGACTTAAACAAGGTTGCAAC...AAGAAAATCCTCAGCAAAGTGAAGAATTGCTTGAAGTAAGCAAC ENST00000615362|A...
# [57167]   2404 GGCGGCTGGACGAGGACGCTCCGAGCCCAGCTCTCGAGAGTTCA...CGCCACTGCACTCCAGCCTGGGTGACAGAGCGAGACTCCGTCTC ENST00000617983|A...
# [57168]   1213 CGCGAGGCGCGCCGCGATCGGGGACTGTCCTAAGACGGGCGGGG...CTCTGTGTGACCCAGAGAAATAAAGATGCCTCAGTGTGGCCCGC ENST00000613204|A...
# [57169]   2405 GGTCTCACTCTGTTGCTGTCTTCACGGAGAGCAGGAGCAGAGGC...CTGGGGAGGCCTGCCTGGTCAATAAACCACTGTTCCTGCAGCTG ENST00000621424|A...

The function also returns detailed information for each selected transcript:

# check
head(lt)
# # A tibble: 6 × 10
# # Groups:   gene [6]
#   transcript_id   idnew                         utr5   cds  utr3 exonlen translen mstart mstop gene        
#   <chr>           <chr>                        <dbl> <dbl> <dbl>   <int>    <dbl>  <dbl> <dbl> <chr>       
# 1 ENST00000609567 ENST00000609567|hsa-mir-1253     0     0     0     105      105      0     0 hsa-mir-1253
# 2 ENST00000381638 ENST00000381638|ZZEF1          125  8883  2445   11456    11456    126  9008 ZZEF1       
# 3 ENST00000617638 ENST00000617638|ZYXP1            0     0     0     118      118      0     0 ZYXP1       
# 4 ENST00000322764 ENST00000322764|ZYX            345  1716   429    2493     2493    346  2061 ZYX         
# 5 ENST00000294353 ENST00000294353|ZYG11B         145  2232  5713    8093     8093    146  2377 ZYG11B      
# 6 ENST00000430329 ENST00000430329|ZYG11AP1         0     0     0     837      837      0     0 ZYG11AP1