Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

postprocess_variants: Found multiple file patterns in input filename space #818

Open
MiWitt opened this issue May 8, 2024 · 16 comments
Open
Assignees

Comments

@MiWitt
Copy link

MiWitt commented May 8, 2024

Have you checked the FAQ? https://github.com/google/deepvariant/blob/r1.6.1/docs/FAQ.md:

Describe the issue:
The postprocess_variants step fails with following error message:
ValueError: ('Found multiple file patterns in input filename space: ', './call_variants_output.tfrecord.gz')

Setup

  • Operating system: CentOS Linux 7 (Core)
  • DeepVariant version: 1.6.1
  • Installation method (Docker, built from source, etc.): singularity
  • Type of data: PacBio Sequencing

Steps to reproduce:

  • Command:
  • Error trace:
    Traceback (most recent call last):
    File "/tmp/Bazel.runfiles_t3t5ek8u/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 1419, in
    app.run(main)
    File "/tmp/Bazel.runfiles_t3t5ek8u/runfiles/absl_py/absl/app.py", line 312, in run
    _run_main(main, args)
    File "/tmp/Bazel.runfiles_t3t5ek8u/runfiles/absl_py/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
    File "/tmp/Bazel.runfiles_t3t5ek8u/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 1300, in main
    sample_name = get_sample_name()
    File "/tmp/Bazel.runfiles_t3t5ek8u/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 1203, in get_sample_name
    _, record = get_cvo_paths_and_first_record()
    File "/tmp/Bazel.runfiles_t3t5ek8u/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 1179, in get_cvo_paths_and_first_record
    raise ValueError(
    ValueError: ('Found multiple file patterns in input filename space: ', './call_variants_output.tfrecord.gz')

Does the quick start test work on your system?
Please test with https://github.com/google/deepvariant/blob/r1.6/docs/deepvariant-quick-start.md.
Is there any way to reproduce the issue by using the quick start?
???

Any additional context:
Yes. I can change the parameter "--infile" of the postprocess_variants.py call from "./call_variants_output.tfrecord.gz" to "./[email protected]" and it works. Anyway, the call of postprocess_variants.py is auto-generated by "/opt/deepvariant/bin/run_deepvariant". The error does not occur for every sample ...

directory content of intermediate_results_dir after the error occured:
call_variants.log
call_variants_output-00000-of-00001.tfrecord.gz
gvcf.tfrecord-00000-of-00008.gz
gvcf.tfrecord-00001-of-00008.gz
gvcf.tfrecord-00002-of-00008.gz
gvcf.tfrecord-00003-of-00008.gz
gvcf.tfrecord-00004-of-00008.gz
gvcf.tfrecord-00005-of-00008.gz
gvcf.tfrecord-00006-of-00008.gz
gvcf.tfrecord-00007-of-00008.gz
make_examples.log
make_examples.tfrecord-00000-of-00008.gz
make_examples.tfrecord-00000-of-00008.gz.example_info.json
make_examples.tfrecord-00001-of-00008.gz
make_examples.tfrecord-00001-of-00008.gz.example_info.json
make_examples.tfrecord-00002-of-00008.gz
make_examples.tfrecord-00002-of-00008.gz.example_info.json
make_examples.tfrecord-00003-of-00008.gz
make_examples.tfrecord-00003-of-00008.gz.example_info.json
make_examples.tfrecord-00004-of-00008.gz
make_examples.tfrecord-00004-of-00008.gz.example_info.json
make_examples.tfrecord-00005-of-00008.gz
make_examples.tfrecord-00005-of-00008.gz.example_info.json
make_examples.tfrecord-00006-of-00008.gz
make_examples.tfrecord-00006-of-00008.gz.example_info.json
make_examples.tfrecord-00007-of-00008.gz
make_examples.tfrecord-00007-of-00008.gz.example_info.json
postprocess_variants.log

@kishwarshafin
Copy link
Collaborator

@MiWitt , can you please send the full command here for each step? It seems like you have 8 files are you are setting @1?

@MiWitt
Copy link
Author

MiWitt commented May 10, 2024

I do not run it step by step. I run "run_deepvariant". This is my command:

 singularity run -B /usr/lib/locale/:/usr/lib/locale/ \
    /my/path/software/deepVariant/deepvariant_${BIN_VERSION}.sif \
    /opt/deepvariant/bin/run_deepvariant \
    --model_type=PACBIO \
    --ref=${THEREF} \
    --reads="${ALIGNMENTNAME}.bam" \
    --sample_name=${SAMPLENAME} \
    --output_vcf="./${ALIGNMENTNAME}.deepVariant.vcf.gz" \
    --output_gvcf="./${ALIGNMENTNAME}.deepVariant.g.vcf.gz" \
    --intermediate_results_dir . \
    --num_shards=8 \
    --logging_dir=.

I have now added the following command, which is a workaround for the problem ...

    if ! [ -f "./${ALIGNMENTNAME}.deepVariant.vcf.gz" ]
    then
       singularity run -B /usr/lib/locale/:/usr/lib/locale/ \
         /my/path/software/deepVariant/deepvariant_${BIN_VERSION}.sif \
         /opt/deepvariant/bin/postprocess_variants \
         --ref="${THEREF}" \
         --infile "./call_variants_output@$(ls ./call_variants_output*.tfrecord.gz | wc -l).tfrecord.gz" \
         --outfile "./${ALIGNMENTNAME}.deepVariant.vcf.gz" \
         --cpus "8" \
         --gvcf_outfile "./${ALIGNMENTNAME}.deepVariant.g.vcf.gz" \
         --nonvariant_site_tfrecord_path "./gvcf.tfrecord@$(ls ./gvcf.tfrecord*.gz | wc -l).gz" \
         --sample_name=${SAMPLENAME}
    fi

Eventually this workaround sets --infile to "./[email protected]" and --nonvariant_site_tfrecord_path to "./[email protected]" (see directory listing above).

@MiWitt
Copy link
Author

MiWitt commented May 10, 2024

I could extract the three commands make_examples, call_variants and postprocess_variants from the output. Here it is:

seq 0 7 | parallel -q --halt 2 --line-buffer /opt/deepvariant/bin/make_examples --mode calling --ref "stdchroms.hg38.fa" --reads "SAMPLENAME.bam" --examples "./[email protected]" --add_hp_channel --alt_aligned_pileup "diff_channels" --gvcf "./[email protected]" --max_reads_per_partition "600" --min_mapping_quality "1" --parse_sam_aux_fields --partition_size "25000" --phase_reads --pileup_image_width "199" --norealign_reads --sample_name "SAMPLENAME" --sort_by_haplotypes --track_ref_reads --vsc_min_fraction_indels "0.12" --task {}

/opt/deepvariant/bin/call_variants --outfile "./call_variants_output.tfrecord.gz" --examples "./[email protected]" --checkpoint "/opt/models/pacbio"

/opt/deepvariant/bin/postprocess_variants --ref "stdchroms.hg38.fa" --infile "./call_variants_output.tfrecord.gz" --outfile "./SAMPLENAME.deepVariant.vcf.gz" --cpus "8" --gvcf_outfile "./SAMPLENAME.deepVariant.g.vcf.gz" --nonvariant_site_tfrecord_path "./[email protected]" --sample_name "SAMPLENAME"

And here are the two last commands with std out ...

***** Running the command:*****
time /opt/deepvariant/bin/call_variants --outfile "./call_variants_output.tfrecord.gz" --examples "./[email protected]" --checkpoint "/opt/models/pacbio"

/usr/local/lib/python3.8/dist-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning: 

TensorFlow Addons (TFA) has ended development and introduction of new features.
TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 

For more information see: https://github.com/tensorflow/addons/issues/2807 

  warnings.warn(
I0510 12:13:42.483308 47501039724352 call_variants.py:563] Total 1 writing processes started.
I0510 12:13:42.487790 47501039724352 dv_utils.py:370] From ./make_examples.tfrecord-00000-of-00008.gz.example_info.json: Shape of input examples: [100, 199, 9], Channels of input examples: [1, 2, 3, 4, 5, 6, 7, 9, 10].
I0510 12:13:42.487916 47501039724352 call_variants.py:588] Shape of input examples: [100, 199, 9]
I0510 12:13:42.488451 47501039724352 call_variants.py:592] Use saved model: True
I0510 12:13:52.162126 47501039724352 dv_utils.py:370] From /opt/models/pacbio/example_info.json: Shape of input examples: [100, 199, 9], Channels of input examples: [1, 2, 3, 4, 5, 6, 7, 9, 10].
I0510 12:13:52.163805 47501039724352 dv_utils.py:370] From ./make_examples.tfrecord-00000-of-00008.gz.example_info.json: Shape of input examples: [100, 199, 9], Channels of input examples: [1, 2, 3, 4, 5, 6, 7, 9, 10].
I0510 12:13:56.551032 47501039724352 call_variants.py:716] Predicted 982 examples in 1 batches [0.419 sec per 100].
I0510 12:13:57.403082 47501039724352 call_variants.py:779] Complete: call_variants.

real	0m21.581s
user	1m40.583s
sys	0m15.744s

***** Running the command:*****
time /opt/deepvariant/bin/postprocess_variants --ref "stdchroms.hg38.fa" --infile "./call_variants_output.tfrecord.gz" --outfile "./SAMPLENAME.deepVariant.vcf.gz" --cpus "8" --gvcf_outfile "./SAMPLENAME.deepVariant.g.vcf.gz" --nonvariant_site_tfrecord_path "./[email protected]" --sample_name "SAMPLENAME"

Traceback (most recent call last):
  File "/tmp/Bazel.runfiles_0t8uq2zt/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 1419, in <module>
    app.run(main)
  File "/tmp/Bazel.runfiles_0t8uq2zt/runfiles/absl_py/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/tmp/Bazel.runfiles_0t8uq2zt/runfiles/absl_py/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/tmp/Bazel.runfiles_0t8uq2zt/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 1300, in main
    sample_name = get_sample_name()
  File "/tmp/Bazel.runfiles_0t8uq2zt/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 1203, in get_sample_name
    _, record = get_cvo_paths_and_first_record()
  File "/tmp/Bazel.runfiles_0t8uq2zt/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 1179, in get_cvo_paths_and_first_record
    raise ValueError(
ValueError: ('Found multiple file patterns in input filename space: ', './call_variants_output.tfrecord.gz')

real	0m4.925s
user	0m8.815s
sys	0m7.379s

@kishwarshafin
Copy link
Collaborator

@MiWitt ,

Given that you are using --intermediate_results_dir . \ which writes all intermediate files to your directory, if you run the same command multiple times then it will create multiple patterns. Can you please create a clean intermediate directory and use that for --intermediate_results_dir /path/to/intermediate_dir? That should resolve the issue.

@MiWitt
Copy link
Author

MiWitt commented May 13, 2024

This can not be the point. I am working in a cluster environment using slurm and the dir "." is the job specific scratch dir, which is located at "/scratch/SlurmTMP/JobSpecificFolder" (${TMPDIR})


cd ${TMPDIR}
BIN_VERSION="1.6.1"
module load singularity/3.5.2


#####################################################################
# singularity pull docker://google/deepvariant:"${BIN_VERSION}"


ulimit -u 10000 # https://stackoverflow.com/questions/52026652/openblas-blas-thread-init-pthread-create-resource-temporarily-unavailable/54746150#54746150

#  --model_type=PACBIO \ ##Replace this string with exactly one of the following [WGS,WES,PACBIO,HYBRID_PACBIO_ILLUMINA]**
#  docker://google/deepvariant:"${BIN_VERSION}" \

if ! [ -f "${WORKINDIR}/${ALIGNMENTNAME}.deepVariant.vcf.gz" ]
then
  cp "${THEREF}"* ./
  cp "${WORKINDIR}/${ALIGNMENTNAME}.bam"* .
  chmod 666 `basename "${THEREF}"`*
  chmod 666 "${ALIGNMENTNAME}.bam"*
  singularity run -B /usr/lib/locale/:/usr/lib/locale/ \
    /my/path/software/deepVariant/deepvariant_${BIN_VERSION}.sif \
    /opt/deepvariant/bin/run_deepvariant \
    --model_type=PACBIO \
    --ref=`basename "${THEREF}"` \
    --reads="${ALIGNMENTNAME}.bam" \
    --sample_name=${SAMPLENAME} \
    --output_vcf="./${ALIGNMENTNAME}.deepVariant.vcf.gz" \
    --output_gvcf="./${ALIGNMENTNAME}.deepVariant.g.vcf.gz" \
    --intermediate_results_dir . \
    --num_shards=8 \
    --logging_dir=.
    
    if ! [ -f "./${ALIGNMENTNAME}.deepVariant.vcf.gz" ]
    then
       singularity run -B /usr/lib/locale/:/usr/lib/locale/ \
         /my/path/software/deepVariant/deepvariant_${BIN_VERSION}.sif \
         /opt/deepvariant/bin/postprocess_variants \
         --ref=`basename "${THEREF}"` \
         --infile "./call_variants_output@$(ls ./call_variants_output*.tfrecord.gz | wc -l).tfrecord.gz" \
         --outfile "./${ALIGNMENTNAME}.deepVariant.vcf.gz" \
         --cpus "8" \
         --gvcf_outfile "./${ALIGNMENTNAME}.deepVariant.g.vcf.gz" \
         --nonvariant_site_tfrecord_path "./gvcf.tfrecord@$(ls ./gvcf.tfrecord*.gz | wc -l).gz" \
         --sample_name=${SAMPLENAME}
    fi
    cp *.log ${WORKINDIR}/
    cp "./${ALIGNMENTNAME}.deepVariant.vcf.gz"* ${WORKINDIR}/
else
 cp "${WORKINDIR}/${ALIGNMENTNAME}.deepVariant.vcf.gz"* .
fi

@kishwarshafin
Copy link
Collaborator

@MiWitt ,

Can you use --intermediate_results_dir ./intermediate_results_ ${ALIGNMENTNAME}. I am unsure why you are running postprocessing separately, but, something must be overwriting the files or generating multiple file patterns in the same directory where you are saving everything. One way to better debug is to set --dry_run=true for each command and look at the outputs and see if they match with each other. Unfortunately I don't have access to an HPC to replicate this issue. I tried running your script but it has many missing variables.

@kishwarshafin
Copy link
Collaborator

@MiWitt

Hi, do you have any updates on this issue?

@kishwarshafin
Copy link
Collaborator

@MiWitt , I am closing the issue due to inactivity. Please feel free to reopen if you have any updates.

@EgorGuga
Copy link

EgorGuga commented Oct 10, 2024

@kishwarshafin still same problem

@EgorGuga
Copy link

something wrong in get_cvo_paths_and_first_record(), it cannot properly parse call_variants_output-00000-of-00001.tfrecord.gz
And maybe run_deepvariant.py needs to be change (at least in docker) for proper usage of multiprocessing of postprocess_varaint

@MiWitt
Copy link
Author

MiWitt commented Oct 10, 2024

@EgorGuga
You can use my workround above, which solved the problem for me.
If the --output_vcf from /opt/deepvariant/bin/run_deepvariant does not exist, run /opt/deepvariant/bin/postprocess_variants in a separate step.

if ! [ -f "./${ALIGNMENTNAME}.deepVariant.vcf.gz" ]
    then
       singularity run -B /usr/lib/locale/:/usr/lib/locale/ \
         /my/path/software/deepVariant/deepvariant_${BIN_VERSION}.sif \
         /opt/deepvariant/bin/postprocess_variants \
         --ref=`basename "${THEREF}"` \
         --infile "./call_variants_output@$(ls ./call_variants_output*.tfrecord.gz | wc -l).tfrecord.gz" \
         --outfile "./${ALIGNMENTNAME}.deepVariant.vcf.gz" \
         --cpus "8" \
         --gvcf_outfile "./${ALIGNMENTNAME}.deepVariant.g.vcf.gz" \
         --nonvariant_site_tfrecord_path "./gvcf.tfrecord@$(ls ./gvcf.tfrecord*.gz | wc -l).gz" \
         --sample_name=${SAMPLENAME}
    fi

@EgorGuga
Copy link

@MiWitt, yes, thanks for that solution, I did a similar thing in the run_deepvariant script

@mulderdt
Copy link

mulderdt commented Jan 2, 2025

For anyone struggling with this error in nextflow I adapted the above answer into a strategy that works in a nextflow process.

Because nextflow exits upon a non zero error, the 'if then' strategy in the previous solution doesn't work since the process will exit before the alternate attempt. Instead we can use the || construct to execute the deepvariant call and then if it chokes on the problem where it looks for call_variants_output.tfrecord.gz but there is only a call_variants_output-00000-of-00001.tfrecord.gz file then it executes the second postprocess_variants call instead.

I don't know why the call_variants_output.tfrecord.gz file is being given a sharded name when the unsharded name path is hard-coded here:

intermediate_results_dir, 'call_variants_output.tfrecord.gz'

I assume a bug?

   	/opt/deepvariant/bin/run_deepvariant \
            --model_type ${deepvariant_model_string} \
            --ref ${reference_fa} \
            --reads ${bam} \
            --output_gvcf output.${chromosome}.g.vcf.gz \
            --output_vcf output.${chromosome}.vcf.gz \
            --num_shards 16 \
            --regions ${chromosome} \
            --sample_name ${SAMPLENAME} \
            --intermediate_results_dir ./tmp \
            --call_variants_extra_args='allow_empty_examples=true' || \
    /opt/deepvariant/bin/postprocess_variants \
            --ref=${reference_fa} \
            --infile "./tmp/call_variants_output@\$(ls ./tmp/call_variants_output*.tfrecord.gz | wc -l).tfrecord.gz" \
            --outfile "output.${chromosome}.vcf.gz" \
            --cpus "16" \
            --gvcf_outfile "output.${chromosome}.g.vcf.gz" \
            --nonvariant_site_tfrecord_path "./tmp/gvcf.tfrecord@\$(ls ./tmp/gvcf.tfrecord*.gz | wc -l).gz" \
            --sample_name=${SAMPLENAME}

@kishwarshafin
Copy link
Collaborator

kishwarshafin commented Jan 3, 2025

@mulderdt,

I would suggest running your command with --dry_run=true and see what commands are being run exactly, then I would suggest running them separately on nextflow. run_deepvariant already wraps around all three processes, so I believe if you are using the same output directory with exact same file names then the current setup will fail.

Also please attach the log on where it is failing.

@kishwarshafin kishwarshafin reopened this Jan 3, 2025
@mulderdt
Copy link

mulderdt commented Jan 5, 2025

@kishwarshafin Thanks,
In a previous version of deepvariant (v1.5.0) executed with regions input using nextflow we were using

/opt/deepvariant/bin/run_deepvariant \
        --model_type ${deepvariant_model_string} \
        --ref ${reference_fa} \
        --reads ${bam} \
        --output_gvcf output.${chromosome}.g.vcf.gz \
        --output_vcf output.${chromosome}.vcf.gz \
        --num_shards 16 \
        --regions ${chromosome} \
        --sample_name ${SAMPLENAME} \
        --intermediate_results_dir ./tmp

After changing to v1.8.0 I ran the same code but got the problem discussed in this thread.

The following code however allowed me to execute successfully. However after running on HG002,HG003,HG004 and benchmarking with hap.py I found that I had much worse performance than with v1.5.0.

/opt/deepvariant/bin/run_deepvariant \
        --model_type ${deepvariant_model_string} \
        --ref ${reference_fa} \
        --reads ${bam} \
        --output_gvcf output.${chromosome}.g.vcf.gz \
        --output_vcf output.${chromosome}.vcf.gz \
        --num_shards 16 \
        --regions ${chromosome} \
        --sample_name ${SAMPLENAME} \
        --intermediate_results_dir ./tmp \
        --call_variants_extra_args='allow_empty_examples=true' || \
/opt/deepvariant/bin/postprocess_variants \
        --ref=${reference_fa} \
        --infile "./tmp/call_variants_output@\$(ls ./tmp/call_variants_output*.tfrecord.gz | wc -l).tfrecord.gz" \
        --outfile "output.${chromosome}.vcf.gz" \
        --cpus "16" \
        --gvcf_outfile "output.${chromosome}.g.vcf.gz" \
        --nonvariant_site_tfrecord_path "./tmp/gvcf.tfrecord@\$(ls ./tmp/gvcf.tfrecord*.gz | wc -l).gz" \
        --sample_name=${SAMPLENAME}

I didn't want to run the hacky code above so I tried just running the deepvariant call without the --intermediate_results code on a hunch and it successfully ran and I had improved performance over v1.5.0.

/opt/deepvariant/bin/run_deepvariant \
        --model_type ${deepvariant_model_string} \
        --ref ${reference_fa} \
        --reads ${bam} \
        --output_gvcf output.${chromosome}.g.vcf.gz \
        --output_vcf output.${chromosome}.vcf.gz \
        --num_shards 16 \
        --regions ${chromosome} \
        --sample_name ${SAMPLENAME} \
        --call_variants_extra_args='allow_empty_examples=true'

Can you help me understand what the default intermediate results folder is if not explicitly defined using that parameter? I don't see the intermediate results anymore anywhere in the nextflow work folders. Are they simply not written or are they getting written to somewhere like /var? I just want to make sure that there isn't going to be issues if the same slurm node is executing multiple deepvariant jobs.

Thanks.

@danielecook
Copy link
Collaborator

@mulderdt

Can you help me understand what the default intermediate results folder is if not explicitly defined using that parameter?

The default intermediate results dir is a temporary directory defined using tempfile.mkdtemp(). It should be unique.

I don't see the intermediate results anymore anywhere in the nextflow work folders.

By default, DeepVariant will output to /tmp; If you are using docker I believe the data is removed once the process is complete. If you are using singularity it might still be stored in /tmp. I'm not positive on either of these though, you'd have to check.

Are they simply not written or are they getting written to somewhere like /var?

They are being written to a temporary directory within /tmp.

I just want to make sure that there isn't going to be issues if the same slurm node is executing multiple deepvariant jobs.

I'm not sure why you observe poor performance with the longer command you provided - you have a lot of options in there though where I think the defaults should be sufficient. My current guess would be if you are mounting the tmp directory within your work directory - then maybe you could have multiple processes writing to the same intermediate folder which in this case would be /tmp locally, but WORKDIR/tmp for each process...?

An option with singularity like --bind:/tmp:$(pwd)/tmp would do this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants