://ftp.ensembl.org/pub/release-113/fasta/saccharomyces_cerevisiae/cdna/Saccharomyces_cerevisiae.R64-1-1.cdna.all.fa.gz
wget https
-1-1.cdna.all.fa.gz
zless Saccharomyces_cerevisiae.R64
>YPL071C_mRNA cdna chromosome:R64-1-1:XVI:420048:420518:-1 gene:YPL071C gene_biotype:protein_coding transcript_biotype:protein_coding description:Putative protein of unknown function; green fluorescent protein (GFP)-fusion protein localizes to both the cytoplasm and the nucleus [Source:SGD;Acc:S000005992]
ATGAGTTCCCGGTTTGCAAGAAGTAATGGCAATCCCAACCACATTAGGAAAAGAAATCAT
TCTCCAGACCCAATAGGAATTGATAATTATAAAAGAAAAAGACTAATTATAGATTTAGAG
AATTTATCCTTAAATGATAAAGGGCCCAAGAACGGACATGCAGATGATAACAATCTTATT
CATAACAATATAGTATTCACAGACGCTATTGATGATAAGGTCCTGAAAGAGATCATCAAG
TGTTCCACAAGTAAACGCGGCGACAATGACTTGTTTTATGACAAAATATGGGAACGTTTG
AGAGAAAAAAGGCTACAAATAATAAAATGGGTAGATTATAAGGAAATTGCTTATCTAAGC
TGGTGGAAGTGGTTCCATAATCAAATGACTTCGAAATACACTTATGATGGAGAGGCTGAT
ACCGATGTTGAAATGATGGCAGTGGATACTGATGTGGATATGGATGCGTAA>YLL050C_mRNA cdna chromosome:R64-1-1:XII:39804:40414:-1 gene:YLL050C gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:COF1 description:Cofilin, involved in pH-dependent actin filament depolarization; binds both actin monomers and filaments and severs filaments; involved in the selective sorting, export of the secretory cargo from the late golgi; genetically interacts with pmr1; thought to be regulated by phosphorylation at SER4; ubiquitous and essential in eukaryotes [Source:SGD;Acc:S000003973]
ATGTCTAGATCTGGTGTTGCTGTTGCTGATGAATCCCTTACCGCTTTCAATGACTTGAAA
TTGGGTAAAAAATACAAATTTATTTTATTCGGATTGAACGATGCTAAAACCGAAATCGTT
GTCAAGGAAACCTCTACTGACCCATCTTACGATGCCTTCTTAGAGAAATTGCCAGAAAAC
GACTGTCTTTACGCCATTTACGATTTTGAATACGAAATTAATGGTAATGAAGGTAAGAGA
TCCAAGATTGTTTTCTTCACTTGGTCTCCAGACACTGCTCCAGTCAGATCTAAGATGGTC
TATGCATCCTCCAAGGATGCCTTAAGAAGAGCCTTAAACGGTGTCTCTACCGATGTTCAA
GGTACTGATTTTTCCGAAGTTTCTTACGATTCTGTTTTGGAAAGAGTCAGCAGAGGCGCT GGTTCTCATTAA
2 Sequence Extraction
2.1 Preparing transcript sequences
The riboTransVis package provides several functions to extract transcript and CDS sequences, facilitating seamless integration with downstream analyses within the package.
2.2 Renaming transcript sequences from Ensembl
If you are aligning reads to the transcriptome and working with transcriptome-aligned BAM files, riboTransVis requires transcript sequence names in a specific format: transcript_ID|gene_name.
If you downloaded transcriptome sequence files from the Ensembl database, you can use the prepare_transcript_file() function to convert the sequence names to the required format. Below is an example using a transcriptome FASTA file of S. cerevisiae (yeast) downloaded from Ensembl:
Simply providing the input FASTA file and specifying the output FASTA file is sufficient to complete the conversion:
library(riboTransVis)
prepare_transcipt_file(transcript_fa = "../Saccharomyces_cerevisiae.R64-1-1.cdna.all.fa",
output_path = "../sac_trans.fa")
The figure below shows a comparison of the transcriptome FASTA file before and after ID conversion:
2.3 Extracting the transcriptome File
Alternatively, riboTransVis provides the get_transcript_sequence() function, which can be used to extract all transcript sequences based on the genome FASTA file and the corresponding GTF annotation file.
Since most genes in yeast species lack annotated UTR regions, it is often beneficial to extend the transcript sequences upstream and downstream to facilitate Ribo-seq–related analyses. This can be done using the extend parameter. Typically, an extension of 50 nucleotides is sufficient.
get_transcript_sequence(genome_file = "../../index-data/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa",
gtf_file = "../../index-data/Saccharomyces_cerevisiae.R64-1-1.112.gtf",
feature = "exon",
extend = T,
extend_upstream = 50,
extend_downstream = 50,
output_file = "sac_trans.fa")
2.4 Extracting CDS sequences
In certain analyses, coding sequence (CDS) information may be required. You can extract the CDS regions from all transcripts by setting feature = “CDS”:
get_transcript_sequence(genome_file = "../../index-data/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa",
gtf_file = "../../index-data/Saccharomyces_cerevisiae.R64-1-1.112.gtf",
feature = "CDS",
output_file = "sac_cds.fa")
2.5 Extracting longest transcript sequences
The get_longest_transcript
function extracts the longest transcript sequence.
The selection criteria are as follows:
1. Transcripts are first sorted by the length of their CDS (coding sequence), and the one with the longest CDS is selected.
2. If multiple transcripts have the same CDS length, then the one with the longest full transcript sequence is chosen.
Here is an example:
lt <- get_longest_transcript(genome_file = "Homo_sapiens.GRCh38.dna.primary_assembly.fa",
gtf_file = "Homo_sapiens.GRCh38.94.gtf.gz",
output_file = "longest_trans.fa")
ltfa <- Biostrings::readDNAStringSet("longest_trans.fa")
ltfa
# DNAStringSet object of length 57169:
# width seq names
# [1] 1657 GTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTC...GCACAGCTAGAGATCCTTTATTAAAAGCACACTGTTGGTTTCTG ENST00000456328|D...
# [2] 712 GTGCACACGGCTCCCATGCGTTGTCTTCCGAGCGTCAGGCCGCC...AACTTGGACTTCCAAGCCTCCAGAACTGTGAGGGATAAATGTAT ENST00000473358|M...
# [3] 138 GGATGCCCAGCTAGTTTGAATTTTAGATAAACAACGAATAATTT...AACATTATTGGTTGTTTATCTGAGATTCAGAATTAAGCATTTTA ENST00000607096|M...
# [4] 840 GCGGTATCTAAATTTGTATTGATTGGACTTTCAAGCTCTTGGGA...CACCCCTGTCTAGGATCTACACATTAAGAAACAAAGACATGAAC ENST00000606857|O...
# [5] 1414 AGCTATCTGAATTTCTCCTTCTCCTAAAAATGCACATCCTATGA...AAAAAGTATATATTTCTATCTAATGTGTGTATCTAATTAACAGC ENST00000642116|O...
# ... ... ...
# [57165] 603 AAATCTGCTCCCGGGGGTATTCTTGACTTAAACAAGGTTGCAAC...AAGAAAATCCTCAGCAAAGTGAAGAATTGCTTGAAGTAAGCAAC ENST00000620795|A...
# [57166] 603 AAATCTGCTCCCGGGGGTATTCTTGACTTAAACAAGGTTGCAAC...AAGAAAATCCTCAGCAAAGTGAAGAATTGCTTGAAGTAAGCAAC ENST00000615362|A...
# [57167] 2404 GGCGGCTGGACGAGGACGCTCCGAGCCCAGCTCTCGAGAGTTCA...CGCCACTGCACTCCAGCCTGGGTGACAGAGCGAGACTCCGTCTC ENST00000617983|A...
# [57168] 1213 CGCGAGGCGCGCCGCGATCGGGGACTGTCCTAAGACGGGCGGGG...CTCTGTGTGACCCAGAGAAATAAAGATGCCTCAGTGTGGCCCGC ENST00000613204|A...
# [57169] 2405 GGTCTCACTCTGTTGCTGTCTTCACGGAGAGCAGGAGCAGAGGC...CTGGGGAGGCCTGCCTGGTCAATAAACCACTGTTCCTGCAGCTG ENST00000621424|A...
The function also returns detailed information for each selected transcript:
# check
head(lt)
# # A tibble: 6 × 10
# # Groups: gene [6]
# transcript_id idnew utr5 cds utr3 exonlen translen mstart mstop gene
# <chr> <chr> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>
# 1 ENST00000609567 ENST00000609567|hsa-mir-1253 0 0 0 105 105 0 0 hsa-mir-1253
# 2 ENST00000381638 ENST00000381638|ZZEF1 125 8883 2445 11456 11456 126 9008 ZZEF1
# 3 ENST00000617638 ENST00000617638|ZYXP1 0 0 0 118 118 0 0 ZYXP1
# 4 ENST00000322764 ENST00000322764|ZYX 345 1716 429 2493 2493 346 2061 ZYX
# 5 ENST00000294353 ENST00000294353|ZYG11B 145 2232 5713 8093 8093 146 2377 ZYG11B
# 6 ENST00000430329 ENST00000430329|ZYG11AP1 0 0 0 837 837 0 0 ZYG11AP1