1  Preprocessing

In this tutorial, we use a publicly available yeast Ribo-seq dataset from the publication 《eIF5A Functions Globally in Translation Elongation and Termination》 (published in Molecular Cell) with GEO accession number GSE89704. as a demonstration to illustrate the upstream preprocessing workflow.

1.1 Data download

We retrieve the Aspera download links for each individual sample from SRA Explorer, and then use the Aspera command-line client (ascp) under a Linux environment to quickly and reliably download the corresponding FASTQ files:

# #!/usr/bin/env bash
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR500/004/SRR5008134/SRR5008134.fastq.gz . && mv SRR5008134.fastq.gz WT.1.fastq.gz
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR500/005/SRR5008135/SRR5008135.fastq.gz . && mv SRR5008135.fastq.gz WT.2.fastq.gz
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR500/006/SRR5008136/SRR5008136.fastq.gz . && mv SRR5008136.fastq.gz eIF5Ad.1.fastq.gz
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR500/007/SRR5008137/SRR5008137.fastq.gz . && mv SRR5008137.fastq.gz eIF5Ad.2.fastq.gz
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR500/008/SRR5008138/SRR5008138.fastq.gz . && mv SRR5008138.fastq.gz WT.1HS.fastq.gz
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR533/004/SRR5335874/SRR5335874.fastq.gz . && mv SRR5335874.fastq.gz WT.2HS.fastq.gz
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR533/005/SRR5335875/SRR5335875.fastq.gz . && mv SRR5335875.fastq.gz eIF5Ad.2HS.fastq.gz
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR500/009/SRR5008139/SRR5008139.fastq.gz . && mv SRR5008139.fastq.gz eIF5Ad.1HS.fastq.gz

1.2 Yeast Genome and Annotation Retrieval

We obtained the genome sequence (FASTA) and gene annotation (GTF) files for the yeast species from the Ensembl database:

wget https://ftp.ensembl.org/pub/release-112/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_ceree.R64-1-1.dna.toplevel.fa.gz 

wget https://ftp.ensembl.org/pub/release-112/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.112.gtf.gz

gunzip Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz 

gunzip Saccharomyces_cerevisiae.R64-1-1.112.gtf.gz

1.3 Constructing rRNA and genome Index

Ribosomal RNA (rRNA) sequences for most species are available from public databases such as NCBI or Ensembl in FASTA format. After downloading, we use the Bowtie2 aligner to construct an index for read alignment and rRNA contamination removal:

bowtie2-build sac.rRNA.fasta sac_rRNA_index/sac_rRNA

Alignment tools such as Bowtie2 and HISAT2 provide pre-built genome indices for many commonly studied species on their official websites. These can be downloaded and used directly to save time. However, if your organism of interest is not included, you can generate the index manually from the reference genome FASTA file using the tool’s indexing function:

bowtie2-build sac.genome.fasta sac_genome_index/sac_genome

hisat2-build sac.genome.fasta sac_genome_index/sac_genome

1.4 Removal of Adapter Contamination

To ensure high-quality input for downstream analysis, we first examine the raw FASTQ files using FastQC to evaluate overall sequence quality and to identify potential adapter contamination. If adapter sequences are present, we remove them using Cutadapt, a commonly used tool for trimming adapter sequences from high-throughput sequencing reads:

# trim
for i in WT.1 WT.2 eIF5Ad.1 eIF5Ad.2 WT.1HS WT.2HS eIF5Ad.1HS eIF5Ad.2HS
do cutadapt -j 15 -m 20 -M 35 \
            --match-read-wildcards \
            -a CTGTAGGCACCATCAAT \
            -o 1.trim-data/${i}.trim.fq.gz 0.raw-data/${i}.fastq.gz
done

1.5 Removal of rRNA Contamination

In eukaryotes, ribosomal RNA (rRNA) can account for approximately 80%–90% of total RNA. Therefore, assessing the proportion of sequencing reads that align to rRNA sequences serves as an important quality control metric for evaluating the effectiveness of RNA library preparation. Many laboratories or sequencing service providers use commercial rRNA depletion kits to remove these abundant molecules prior to sequencing.

In this workflow, we use Bowtie2 to align raw reads to the rRNA reference sequences and remove the reads that map to rRNA. Only the unmapped reads are retained for downstream analyses.

If more stringent filtering is required to eliminate other unwanted small RNA contaminants, such as tRNA, you can also download the corresponding reference sequences, build a Bowtie2 index, and filter out these reads in a similar manner.

# trim rRNA
for i in WT.1 WT.2 eIF5Ad.1 eIF5Ad.2 WT.1HS WT.2HS eIF5Ad.1HS eIF5Ad.2HS
do 
    bowtie2 -p 20 -x ../index-data/sac-rRNA-index/Saccharomyces-cerevisiae-rRNA \
            --un-gz 2.rmrRNA-data/${i}.rmrRNA.fq.gz \
            -U 1.trim-data/${i}.trim.fq.gz \
            -S 2.rmrRNA-data/null
done

1.6 Alignment to the Reference Genome

After removing rRNA reads, we align the remaining clean reads to the yeast reference genome using HISAT2, a fast and memory-efficient spliced alignment tool optimized for high-throughput sequencing data:

for i in WT.1 WT.2 eIF5Ad.1 eIF5Ad.2 WT.1HS WT.2HS eIF5Ad.1HS eIF5Ad.2HS
do 
    hisat2 -p 20 -x ../index-data/sac-hisat2-index/sac \
           -k 1 -U 2.rmrRNA-data/${i}.rmrRNA.fq.gz \
           |samtools sort -@ 20 -o 3.map-data/${i}.sorted.bam
done

1.7 Alignment to the Transcriptome

In certain cases, specific analyses—such as Ribo-seq studies that aim to determine the precise position of each read on a transcript, or assess features like 3-nucleotide periodicity and translation periodicity—require mapping reads directly to the transcriptome rather than the genome.

To perform transcriptome alignment, you first need to download the transcript sequences for the species of interest and build a corresponding index file. The reads can then be aligned to the transcriptome using an appropriate aligner (e.g., Bowtie2 or HISAT2), enabling more accurate downstream analysis in transcript-centric workflows.

for i in WT.1 WT.2 eIF5Ad.1 eIF5Ad.2 WT.1HS WT.2HS eIF5Ad.1HS eIF5Ad.2HS
do 
    hisat2 -p 20 -x ../index-data/sac-hisat2-trans-index/sac \
           -k 1 -U 2.rmrRNA-data/${i}.rmrRNA.fq.gz \
           |samtools sort -@ 20 -o 3.map-data/${i}.sorted.bam
done