FALCON Unzip based pipeline integrating DAmasker and boosting Unzip speed
To perform the pipeline, along with a complete FalconUnzip installation, that we assume is placed in /path/to/falcon-verXX
, a proper smrtlink distribution (we assume installed ) is also necessary according to the sequencing chemistry used.
Note: The tool best performs on a cluster, where single processes of the different steps may be run in parallel. Nevertheless, a single machine can run multiple process using parallel1. Such processe that di benefit from this will be marked and approximate RAM and core usage will be given for a single process.
Remeber to set the paths to Falcon libararies and environments
export PYTHONUSERBASE=/path/to/falcon-verXX/
export LD_LIBRARY_PATH=/path/to/falcon-verXX/lib:${LD_LIBRARY_PATH}
export PATH=/path/to/smrtlink/install/smrtlink-release_X.XX/bundles/smrttools/install/smrttools-release_X.XX/smrtcmds/bin/:/path/to/falcon-verXX/bin:${PATH}
This step is necessary for Arrow polishing as Blasr mapping files with other input file formats will not be accepted
for name in $(find -name "*bax.h5" | sed 's:\..\.bax\.h5$::' | sort -u | less -S); do echo $name ; bax2bam $name.1.bax.h5 $name.2.bax.h5 $name.3.bax.h5; done
mkdir 0-rawreads
cd 0-rawreads
fasta2DB -v raw_reads $reads
DBsplit -x500 -s100 raw_reads
LB=$(cat raw_reads.db | LD_LIBRARY_PATH= awk '$1 == "blocks" {print $3}')
echo $LB
Note: As DAligner merging accepts 250 chunks or less, tune
-s
parameter in order to have a $LB <= 500 as this pipeline is thought for a maximum of 2x250 batches.
CUTOFF=3000
echo -n $CUTOFF > length_cutoff
mkdir scripts
cd scripts
ln -s ../raw_reads.db
HPC.TANmask -T8 -frun_jobs.01.TANmask raw_reads 1-$LB
HPC.REPmask -T8 -g2 -c15 -mtan -frun_jobs.02.REPmask raw_reads 1-$LB
HPC.daligner -d -v -B128 -D512 -t30 -e0.7 -M40 -l1000 -k16 -h64 -w7 -s1000 -mtan -mrep2 -H$CUTOFF -frun_jobs.03.Daligner -T8 raw_reads 1-$LB
for block_id in $(seq 1 $LB); do echo "LA4Falcon -H$CUTOFF -fso raw_reads.db raw_reads.${block_id}.las | fc_consensus --n_core 8 --output_multi --min_idt .70 --min_cov 4 --max_n_read 400 > cns_${block_id}.fasta "; done > run_jobs.04.Call_consensus.01.call
cd ..
bash scripts/run_jobs.01.TANmask.01.OVL
- Parallelizable: task_mem="10G", task_cores="8"
bash scripts/run_jobs.01.TANmask.02.SORT
- Parallelizable: task_mem="10G", task_cores="1"
- [Optional Checkpoint]
bash scripts/run_jobs.01.TANmask.03.CHECK.OPT
bash scripts/run_jobs.01.TANmask.04.RM
bash scripts/run_jobs.01.TANmask.05.MASK
- Parallelizable: task_mem="10G", task_cores="8"
Catrack -v raw_reads.db tan
bash scripts/run_jobs.01.TANmask.06.RM
bash scripts/run_jobs.02.REPmask.01.OVL
- Parallelizable: task_mem="35G", task_cores="8"
bash scripts/run_jobs.02.REPmask.02.SORT
- Parallelizable: task_mem="10G", task_cores="1"
- [Optional Checkpoint] `bash scripts/bash scripts/run_jobs.02.REPmask.03.CHECK.OPT
bash scripts/run_jobs.02.REPmask.04.RM
bash scripts/run_jobs.02.REPmask.05.MERGE
- Parallelizable: task_mem="15G", task_cores="1"
- [Optional Checkpoint]
bash scripts/run_jobs.02.REPmask.06.CHECK.OPT
bash scripts/run_jobs.02.REPmask.07.RM
bash scripts/run_jobs.02.REPmask.08.MASK
- Parallelizable: task_mem="15G", task_cores="8"
- `Catrack -v raw_reads.db rep2
bash scripts/run_jobs.02.REPmask.09.RM
bash scripts/run_jobs.03.Daligner.00.MKDIR
bash scripts/run_jobs.03.Daligner.01.OVL
- Parallelizable: task_mem="35G", task_cores="8"
bash scripts/run_jobs.03.Daligner.02.SORT
- Parallelizable: task_mem="10G", task_cores="1"
- [Optional Checkpoint]
bash scripts/run_jobs.03.Daligner.03.CHECK.OPT
bash scripts/run_jobs.03.Daligner.04.RM
- [OPTIONAL] Split merging processes if the 250< # Chunks ($LB) <= 500
sed -i 's:\(.*raw_reads\.\)\([0-9]*\) \(.*\)250 \(.*\):\1\2\.1 \3250; \1\2\.2 \4; \1\2 raw_reads\.\2\.1 raw_reads\.\2\.2:;s: L[0-9]*.[0-9]*.;:;:' scripts/run_jobs.03.Daligner.05.MERGE
bash scripts/run_jobs.03.Daligner.05.MERGE
- Parallelizable: task_mem="15G", task_cores="1"
- [Optional Checkpoint]
bash scripts/run_jobs.03.Daligner.06.CHECK.OPT
bash scripts/run_jobs.03.Daligner.07.RM.OPT
bash scripts/run_jobs.04.Call_consensus.01.call
- Parallelizable: task_mem="35G", task_cores="8"
ls -1 $(pwd)/cns_*.fasta > input_preads.fofn
#mkdir safe_copy
#for file in $(cat input_preads.fofn); do echo $file; cp $file safe_copy/$file; done
touch rdb_build_done
mkdir ../1-preads_ovl
while read fn; do fasta2DB -v ../1-preads_ovl/preads $fn; done < input_preads.fofn
cd ../1-preads_ovl
DBsplit -x500 -s100 preads
LB=$(cat preads.db | LD_LIBRARY_PATH= awk '$1 == "blocks" {print $3}')
mkdir scripts
cd scripts/
ln -s ../preads.db
HPC.TANmask -T8 -frun_jobs.06.TANmask_corr preads 1-$LB
HPC.REPmask -T8 -g2 -c15 -mtan -frun_jobs.07.REPmask_corr preads 1-$LB
cd ..
bash scripts/run_jobs.06.TANmask_corr.01.OVL
- Parallelizable: task_mem="16G", task_cores="8"
bash scripts/run_jobs.06.TANmask_corr.02.SORT
- Parallelizable: task_mem="10G", task_cores="1"
- [Optional Checkpoint]
bash scripts/run_jobs.06.TANmask_corr.03.CHECK.OPT
bash scripts/run_jobs.06.TANmask_corr.04.RM
bash scripts/run_jobs.06.TANmask_corr.05.MASK
- Parallelizable: task_mem="10G", task_cores="8"
Catrack -v preads.db tan
bash scripts/run_jobs.06.TANmask_corr.06.RM
- `bash scripts/run_jobs.07.REPmask_corr.01.OVL
- Parallelizable: task_mem="35G", task_cores="8"
bash scripts/run_jobs.07.REPmask_corr.02.SORT
- Parallelizable: task_mem="10G", task_cores="1"
- [Optional Checkpoint]
bash scripts/run_jobs.07.REPmask_corr.03.CHECK.OPT
bash scripts/run_jobs.07.REPmask_corr.04.RM
bash scripts/run_jobs.07.REPmask_corr.05.MERGE
- Parallelizable: task_mem="15G", task_cores="1"
- [Optional Checkpoint]
bash scripts/run_jobs.07.REPmask_corr.06.CHECK.OPT
bash scripts/run_jobs.07.REPmask_corr.07.RM
bash scripts/run_jobs.07.REPmask_corr.08.MASK
- Parallelizable: task_mem="15G", task_cores="8"
Catrack -v preads.db rep2
bash scripts/run_jobs.07.REPmask_corr.09.RM
HPC.daligner -v -B128 -D512 -M40 -t60 -k20 -h256 -e.96 -l1000 -s100 -mtan -mrep2 -T8 -H10000 preads 1-$LB >| run_jobs.sh
mkdir preads-fofn-abs
echo "echo NOOP preads" > preads-fofn-abs/noop.sh
touch preads-fofn-abs/run.sh.done
cp ../0-rawreads/input_preads.fofn preads-fofn-abs/
touch prepare_pdb.sh
touch pdb_build_done
touch ../0-rawreads/rdb_build_done
Note: Remeber to add
-mtan -mrep2
tags to DAligner options as to take advantage of repeats marking
- Falcon must be run starting from the corrected reads. In General tab set:
input_fofn = 0-rawreads/input_preads.fofn
input_type = preads
- Add
-mtan -mrep2
tags to DAligner options to use repeats marking information - Run Falcon
$PYTHONUSERBASE/bin/fc_run.py fc_run.cfg
Note: An example of configuration file to run Falcon in a SLURM environment with properly set configuration parameters is available in the repository as fc_run.masked.cfg
- Extract reads mapping on the p_ctg contigs
mkdir -p 3-unzip/reads
python -m falcon_kit.mains.get_read_ctg_map
python -m falcon_kit.mains.rr_ctg_track
python -m falcon_kit.mains.pr_ctg_track
cd 3-unzip/reads
mkdir split
cd split
awk 'BEGIN {n_seq=0;count=0} /^>/ {if(n_seq%5000==0){file=sprintf("%010d.fa",count);count++} print >> file; n_seq++; next;} { print >> file; }' raw.fasta
basedir=$(realpath ../../../)
for file in $(find ./ -name "*.fa" | sort) ; do name=$(basename $file .fa); mkdir -p $name; realpath $file > ${name}/input.fofn ; echo "python /path/to/falcon-verXX/lib/python2.7/site-packages/falcon_kit/mains/fetch_reads.mod.py --base_dir $basedir --fofn ${name}/input.fofn --out_dir ${name}/ --min_ctg_lenth 2000" ; done > fetch.sh
- Use fetch_reads.mod.py script to extract reads of one reference sequence only. The script is a modified version of
fetch_reads.py
FalconUnzip script placed in the same directoty (/path/to/falcon-verXX/lib/python2.7/site-packages/falcon_kit/mains/
).
- Retrieve mapping reads
bash scripts/fetch.sh
- Parallelizable: task_mem="2G", task_cores="1"
- Extract the contig list
find ./ -name "ctg_list" -exec cat {} \; | sort -u > ../ctg_list
cd ..
-
Create reference fasta files `python /path/to/falcon-verXX/lib/python2.7/site-packages/falcon_kit/mains/fetch_reads.ref_only.py --base_dir
$(realpath ../../) --fofn ../../input.fofn --out_dir $ (pwd) --min_ctg_lenth 2000- Use fetch_reads.ref_only.py script to extract only the reference sequences. The script is a modified version of
fetch_reads.py
FalconUnzip script preventing the read binning placed in the same directoty (/path/to/falcon-verXX/lib/python2.7/site-packages/falcon_kit/mains/
).
- Use fetch_reads.ref_only.py script to extract only the reference sequences. The script is a modified version of
-
Merge the fetching output
for name in $(cat ctg_list); do echo $name; find split/ -name ${name}_reads.fa -exec cat {} \; > ${name}_reads.fa ; done;
cat split/*/unassigned_reads.fa > unassigned_reads.fa
- Create end of pipeline control files
touch track_reads_done.exit
touch track_reads_done
touch run.sh.done
- Add auxiliary files
- Edit the path to the falcon directories (
/path/to/falcon-verXX/bin
)] - Edit the path to assembly folders (
/full/path/to
) to point ot the actual diercories - Files to add (templates can be found in this repository):
- Edit the path to the falcon directories (
- Run Unzip
fc_unzip.py fc_unzip.cfg
- Example of configuration file fc_unzip.cfg
1: Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.