Skip to content

Latest commit

 

History

History
370 lines (293 loc) · 16.5 KB

README.md

File metadata and controls

370 lines (293 loc) · 16.5 KB

EARRINGS

EARRINGS is an efficient and accurate adapter trimmer that entails no a priori adapter sequences for both single- and paired-end NGS reads.

Information

EARRINGS is a more powerful and capable successor of PEAT, which is now deprecated.

Like PEAT, EARRING adapts skewer for single-end read trimming.

Requirement

  • GNU g++-8 or higher (Gtest works with GNU g++-8 only )
  • cmake 3.10.0 or higher to build EARRINGS
  • python 3.7 or higher as well as numpy and pandas packages for the benchmarking

Build

# In root directory of repository
> mkdir build
> cd build
> cmake .. -DBUILD_TESTS=OFF -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=../stage
> cmake --build . --target install
> ./EARRINGS -h

Q&A

Q: Failed while compiling and get the error message about fatal: unable to access 'https://github.com/PhoebeWangintw/hunter.git/': Could not resolve host: github.com;.
A: Please make sure your connection to ```github.com`` is clear, especially for the user who lives inside the GFW of China.

Q: How do I change my default GUN GCC compiler?
A: To change default gcc and g++ version with 8.0 as example:\

# This environment setting works in the current session only.
export CC=/Path/Locate/To/The/gcc-8
export CXX=/Path/Locate/To/The/g++-8

Read Before Execution

This note is written for Single-End mode usage, including two other special modes: smallRNA and skewer.

Residual of 5' end adapter

The Adapter Trimmer is typically used to process the 3' end adapter, but in some datasets, we have observed remnants of the 5' end adapter. In Paired-End mode, it will automatically handle both 5' and 3' adapters through cross-comparison. However, in Single-End mode, accurately identifying the 3' end adapter may be challenging due to the presence of residual 5' end adapter. Therefore, when using Single-End mode, it is essential to verify whether there are remnants of the 5' end adapter. If so, address this issue before running EARRINGS.

TDMD as 3' end adapter

When using the smallRNA mode, your reads are generally short and EARRINGS usually sets the seed length between 18-25. A short seed length significantly affects accurate adapter detection, especially when your data is rich in TDMD (Target-directed miRNA degradation). In TDMD, there may be a large amount of tailing in the small RNA reads that cannot be aligned on the genome, increasing the likelihood of being misjudged as part of the adapter. Therefore, different seed lengths alignments are performed to determine which adapter provides the best trimming result when using the smallRNA mode. However, not every dataset is ideal; thus it's important to confirm whether tails are being misjudged as adapters when processing small RNA. Failure to address this issue can lead to loss of TDMD information and biased results.

Execution

There are 5 modes to execute EARRINGS: build, single, paired, smallRNA and skewer.

Build mode generates an index for the source reference sequence (e.g., the entire genome, a chromosome, a collection of panel genes, GreenGenes for meta-genomics etc.) of the single-end reads. The index is not used for trimming paired-end reads.

Single mode and paired mode are used for single-end reads and paired-end reads respectively. Both of these two modes can auto-detect file types. (.fa/.fq, and their .gz or .bam/.ubam)

Build

Before conducting single-end adapter trimming, one has to prebuild the index once for a specific reference which is the source of the target reads.

# ./EARRINGS build -r [ref_path] -p [index_prefix]
> ./EARRINGS build -r hg38_chr1.fa -p earrings_hg38_chr1
> ./EARRINGS build -r 16S_rRNA_panel.fa -p 16S_rRNA_panel

Build mode parameters

  • Required
    • -r [ --ref_path ] arg
      Path to the reference sequence, which is the source of reads.
    • -p [ --index_prefix ] arg
      An user-defined index prefix for index table.
  • Optional
    • -h [ --help ]
      Display help message and exit.

Single-End

In single-end mode, EARRINGS first detects adapter then feeds the detected adapter to skewer.

# ./EARRINGS single -p [index_prefix] -1 [input_file]
> ./EARRINGS single -p earrings_idx -1 input1.fq
> ./EARRINGS single -p earrings_idx -1 input1.fq.gz

Single-End mode parameters

  • Required
    • -p [ --index_prefix ] arg
      The index prefix for pre-built index table.
    • -1 [ --input1 ] arg
      The file path of Single-End reads.
  • Optional
    • Utils
      • -h [ --help ]
        Display help message and exit.
      • -t [ --thread ] arg (=1)
        The number of threads used to run the program.
    • Input / Output
      • -o [ --output ] arg (=trimmed_se)
        The file prefix of Single-End FastQ output.
    • Extract seeds / Alignment
      • -d [ --seed_len ] arg (=50)
        The first --seed_len bases are seen as seed and allows 1 mismatch at most, or do not allow any mismatch if --no_mismatch is set. The sequence follows first mismatch out of the seed portion will be reported as a tail.
        It's recommended to set this to 18 for very short reads like miRNA, otherwise, it is recommended to set to 50.
      • -e [ --no_mismatch ]
        By default, EARRINGS can tolerate 1 error base at most if be set as true, this flag can disable this mismatch toleration mechenism.
      • -M [ --max_align ] arg (=0)
        Maximum number of candidates used in seed finding stage, 0 means unlimited.
    • Assemble adapter
      • -f [ --prune_factor ] arg (=0.03)
        Prune factor used when assembling adapters using the de-brujin graph. Kmer frequency lower than this value will be skipped.
      • --sensitive
        By default, minimum number of kmers must exceed 10 during assembly adapters. However, if user have confidence that the dataset contains adapters, sensitive mode would be more suitable.
        Under sensitive mode, minimum number of kmers (--prune_factor) would not be restricted.
    • Trimming
      • -m [ --min_length ] arg (=0)
        Skip the read if the length of the read is less than --min_length after trimming.
    • Adapter setting
      • -a [ --adapter1 ] arg (=AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC)
        Alternative adapter if auto-detect mechanism fails.
      • -u [ --UMI ]
        Estimate the size of UMI sequences, results will be printed to console by default.

Paired-End

# ./EARRINGS paired -1 [input1] -2 [input2]
> ./EARRINGS paired -1 input1.fq -2 input2.fq
> ./EARRINGS paired -1 input1.fq.gz -2 input2.fq.gz

Paired-end mode parameters

  • Required
    • -1 [ --input1 ] arg
      The Paired-end reads input file 1.
    • -2 [ --input2 ] arg
      The Paired-end reads input file 2, note that if input file is .bam/.ubam, this parameter has no function.
  • Optional
    • Utils
      • -h [ --help ]
        Display help message and exit.
      • -t [ --thread ] arg (=1)
        The number of threads used to run the program.
    • Input / Output
      • -o [ --output ] arg (=trimmed_pe)
        The Paired-End FastQ output file prefix.
    • Assemble adapter
      • -f [ --prune_factor ] arg (=0.03)
        Prune factor used when assembling adapters using the de Bruijn graph. Kmer frequency lower than the prune factor will be skipped.
      • --sensitive
        By default, minimum number of kmers must exceed 10 during assembly adapters. However, if user have confidence that the dataset contains adapters, sensitive mode would be more suitable.
        Under sensitive mode, minimum number of kmers (--prune_factor) would not be restricted.
    • Trimming
      • -m [ --min_length ] arg (=0)
        Skip the read if the length of the read is less than this value.
      • -M [ --rc_thres ] arg (=0.7)
        Mismatch threshold applied in reverse complement scan.
      • -s [ --ss_thres ] arg (=0.9)
        Mismatch threshold applied in gene portion check.
      • -S [ --as_thres ] arg (=0.8)
        Mismatch threshold applied in adapter portion check.
    • Adapter setting
      • -a [ --adapter1 ] arg (=AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC)
        Alternative adapter 1 if auto-detect mechanism fails.
      • -A [ --adapter2 ] arg (=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA)
        Alternative adapter 2 if auto-detect mechanism fails.

Small-RNA

In the special small RNA single-end mode, EARRINGS first detects the adapter by trying different seed lengths and then feeds the detected adapter to skewer (default as sensitive mode).

# ./EARRINGS smallRNA -p [index_prefix] -1 [input_file]
> ./EARRINGS smallRNA -p earrings_idx -1 input1.fq
> ./EARRINGS smallRNA -p earrings_idx -1 input1.fq.gz

Special Small-RNA Single-End mode parameters

  • Required
    • -p [ --index_prefix ] arg
      The index prefix for pre-built index table.
    • -1 [ --input1 ] arg
      The file path of Single-End reads.
  • Optional
    • Utils
      • -h [ --help ]
        Display help message and exit.
      • -t [ --thread ] arg (=1)
        The number of threads used to run the program.
    • Input / Output
      • -o [ --output ] arg (=trimmed_se)
        The file prefix of Single-End FastQ output.
    • Extract seeds / Alignment
      • -s [ --min_seed_len ] arg (=21)
        Use the same parameters as --seed_len in Single-End mode, but set the minimum seed length for the range to try different seed lengths to detect adapters for small RNA reads.
        It's recommended to set this from 18 to 25 for very short reads like miRNA.
      • -d [ --max_seed_len ] arg (=25)
        Use the same parameters as --seed_len in Single-End mode, but set the maximum seed length for the range to try different seed lengths to detect adapters for small RNA reads.
        It's recommended to set this from 18 to 25 for very short reads like miRNA.
      • -e [ --no_mismatch ]
        By default, EARRINGS can tolerate 1 error base at most if be set as true, this flag can disable this mismatch toleration mechenism.
      • -M [ --max_align ] arg (=0)
        Maximum number of candidates used in seed finding stage, 0 means unlimited.
    • Trimming
      • -m [ --min_length ] arg (=0)
        Skip the read if the length of the read is less than --min_length after trimming.
    • Adapter setting
      • -a [ --adapter1 ] arg (=AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC)
        Alternative adapter if auto-detect mechanism fails.
      • -u [ --UMI ]
        Estimate the size of UMI sequences, results will be printed to console by default.

Skewer

In the special Skewer single-end mode, EARRINGS will not automatically detect any adapter. It will only use the user-provided adapter for trimming via Skewer (default as sensitive mode).

# ./EARRINGS skewer -p [index_prefix] -1 [input_file]
> ./EARRINGS skewer -1 input1.fq -a adapter_seq
> ./EARRINGS skewer -1 input1.fq.gz -a adapter_seq

Special Skewer Single-End mode parameters

  • Required
    • -1 [ --input1 ] arg
      The file path of Single-End reads.
    • -a [ --adapter ] arg
      Adapter squence for trimming.
  • Optional
    • Utils
      • -h [ --help ]
        Display help message and exit.
      • -t [ --thread ] arg (=1)
        The number of threads used to run the program.
    • Input / Output
      • -o [ --output ] arg (=trimmed_se)
        The file prefix of Single-End FastQ output.
    • Trimming
      • -m [ --min_length ] arg (=0)
        Skip the read if the length of the read is less than --min_length after trimming.

Run Simulation

The simulation is carried out using a modified version of pIRS (profile based Illumina pair-end Reads Simulator).
Please build pIRS first before using it:

> cd simulator/pIRS
> make

pIRS simulates Illumina PE reads from a reference genome. In all the benchmarking except real data benchmarking, we use GRCh38 chr1 as reference genome to simulate reads.

One can run the simulation by:

# python3 script.py [seed] -r [ref_genome] -n [read_num] -m [insert_mean] -std [insert_td] -ad1 [ad1] -ad2 [ad1]
# To generate reads without adapters:
> python3 script.py 0 -r hg38_chr1.fa -n 10000 -m 150 -std 50 -ad1 "" -ad2 ""
# To generate reads with default adapters:
> python3 script.py 0 -r hg38_chr1.fa -n 10000 -m 150 -std 50

Run Benchmarking

Before running benchmarking, please install all the prerequisites and set the locations of all the executions to $PATH:

  1. AdapterRemoval ver. 2.3.0
  2. skewer ver. 0.2.2
  3. Cutadapt ver. 2.4
  4. AKATrim ver. 1.3.3 (no source link now)
  5. Trimmomatic ver. 0.39
  6. SeqPurge ver. 2019_11
  7. fastp ver. 0.20.0 - commit 6ff0ffa
  8. atropos ver. 1.1.21 - commit 2b15c77
  9. PEAT ver. 1.2.5
  10. python ver. 3.7 or higher
  11. picard 2.23.0
  12. our modified version of pIRS (see Run Simulation part)

For speed, memory, performance and adapter benchmarking, please download the first chromosome of GRCh38 from UCSC genome browser. For running real data benchmarking, please download the following four datasets and their corresponding reference genome(.fa) from ncbi.

  1. SRR529095: RNA IP-Seq of Argonaute proteins from Homo sapiens
  2. SRR014866: miRNA of C.elegans
  3. SRR330569: RNA-Seq of Gonads and Carcasses in D. simulans and D. pseudoobscura
  4. SRR5000681: ATAC-Seq of Early Embryo replicate in C.elegans

Also, don't forget to install the aligners:

  1. Bowtie2 v2.4.1
  2. HISAT2 v2.2.0

After that, modify path-related variables in path.py to the path where the downloaded datasets were placed. Then prebuild indices for the aligners and EARRINGS by running:

> python3 build_index.py

Run real data benchmarking:

> python3 benchmark_real_data.py

Run speed benchmarking:

> python3 benchmark_speed.py

Run memory benchmarking:

> python3 benchmark_mem_usage.py

Run performance benchmarking:

> python3 benchmark_performance.py

Run adapter benchmarking:

> python3 benchmark_adapter.py

Reference

  1. Li, Y.-L., Weng, J.-C., Hsiao, C.-C., Chou, M.-T., Tseng, C.-W., & Hung, J.-H. (2015). PEAT: an intelligent and efficient paired-end sequencing adapter trimming algorithm. BMC Bioinformatics, 16(Suppl 1), S2. doi:10.1186/1471-2105-16-S1-S2
    [paper] [github]
  2. Jiang, H., Lei, R., Ding, S.W. and Zhu, S. (2014) Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads. BMC Bioinformatics, 15, 182.
    [paper] [github]
  3. Hu, X., et al. (2012) pIRS: Profile-based Illumina pair-end reads simulator, Bioinformatics, 28, 1533-1535.
    [paper] [github]
  4. Chou, M.-T., Han, B. W., Hsiao, C.-P., Zamore, P. D., Weng, Z., and Hung, J.-H. (2015). Tailor: a computational framework for detecting non-templated tailing of small silencing RNAs. Nucleic Acids Res. 43, e109.
    [paper] [github]

Citation

Wang et al. EARRINGS: An Efficient and Accurate Adapter Trimmer Entails No a Priori Adapter Sequences. Bioinformatics. Accepted.
[paper] [github]

Contact

Jui-Hung Hung <[email protected]>

Ting-Hsuan Wang <[email protected]>

Cheng-Ching Huang <[email protected]>