Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to subset .vcf.gz file to include only variants whose genomic coordinates are given in a list #2332

Open
annilk opened this issue Dec 9, 2024 · 7 comments

Comments

@annilk
Copy link

annilk commented Dec 9, 2024

Hi,

I would like to create a subset of a large .vcf.gz file so that I would be able to read it in R with read.vcfR from the vcfR package (I get memory issues if I try to read the non-subsetted .vcf.gz file). I only need certain variants given in a list. What I have tried:

~/bcftools-1.12/bcftools view -T snplist.txt hbcs_sisu_b38.vcf.gz -o hbcs_sisu_b38_subset.vcf.gz

The 'snplist.txt' is tab-delimited and includes columns '#CHROM' and 'POS' (not sure if they were required).

I have also tried option '-R' instead of '-T' for the 'view' command, and command 'filter' instead of 'view' with both options '-T' and '-R'. But depending on which variants are included in snplist.txt, in the subsetted there is always either just one variant or no variants at all, even though

less -S hbcs_sisu_b38.vcf.gz | grep -f snplist.txt

prints lines for more variants.

I am not sure if .csi file was required here, but I have created hbcs_sisu_b38.vcf.gz.csi like this:

~/bcftools-1.12/bcftools index hbcs_sisu_b38.vcf.gz

@pd3
Copy link
Member

pd3 commented Dec 9, 2024

The command looks correct. This is a very basic functionality, so it's strange it wouldn't work. Can you try to upgrade to the latest version of bcftools, we are at 1.21 now. If there is something wrong with the input data, the newer version might give some informative error messages.

The -T option does not require an index, so it's unlikely that it is the problem.

If upgrading does not help, can you provide a small test case for us to reproduce the problem?

@annilk
Copy link
Author

annilk commented Dec 9, 2024

Hi,

Thanks for the fast reply. I downloaded and installed version 1.21 but now I get an error message saying 'Could not parse 2-th line of file snplist.txt, using the columns 1,2[,3] Failed to read the targets: snplist.txt'

Here is a head of snplist.txt:
#CHROM POS
1 19831748
1 30185237
1 30187395

Head of hbcs_sisu_b38.vcf.gz would be quite massive so I copy-pasted here only seven first columns of the output when I run less -S hbcs_sisu_b38.vcf.gz | grep -f snplist.txt:

#CHROM POS ID REF ALT QUAL FILTER
chr1 19831748 rs4509550 T C . PASS
chr1 30185237 rs7536179 T C . PASS
chr1 30187395 rs11371593 T TG . PASS

@annilk
Copy link
Author

annilk commented Dec 14, 2024

Let me know if you need more information to be able to reproduce the problem!

@pd3
Copy link
Member

pd3 commented Jan 2, 2025

Is your file tab-delimited as described in the documentation? http://samtools.github.io/bcftools/bcftools.html#common_options

Regions can be specified either on command line or in a VCF, BED, or tab-delimited file (the default)

@annilk
Copy link
Author

annilk commented Jan 4, 2025

Yes, I think it is tab-delimited. I'm not sure which way would be the best way to verify this, but if I run
zcat hbcs_sisu_b38.vcf.gz | grep "^#CHROM" | awk -F'\t' '{for(i=1;i<=NF;i++) print "Column " i " is tab-delimited: " $i}'

it prints
Column 1 is tab-delimited: #CHROM
Column 2 is tab-delimited: POS
Column 3 is tab-delimited: ID
Column 4 is tab-delimited: REF
Column 5 is tab-delimited: ALT
Column 6 is tab-delimited: QUAL
Column 7 is tab-delimited: FILTER
Column 8 is tab-delimited: INFO
Column 9 is tab-delimited: FORMAT
...
until the last column.

And I get the same column names by running zcat hbcs_sisu_b38.vcf.gz | grep "^#CHROM".

My snplist.txt is also tab-delimited.

@pd3
Copy link
Member

pd3 commented Jan 22, 2025

You are showing what the VCF looksl ike, but the problem would be with the site list. Try

cat snplist.txt | head -2 | od -c

or

cat snplist.txt | head -2 | hexdump -Cc

@annilk
Copy link
Author

annilk commented Jan 25, 2025

Running cat snplist.txt | head -2 | od -c gives

0000000 # C H R O M \t P O S \n 1 \t 1 9 8
0000020 3 1 7 4 8 \n
0000026

and running cat snplist.txt | head -2 | hexdump -Cc gives

00000000 23 43 48 52 4f 4d 09 50 4f 53 0a 31 09 31 39 38 |#CHROM.POS.1.198|
0000000 # C H R O M \t P O S \n 1 \t 1 9 8
00000010 33 31 37 34 38 0a |31748.|
0000010 3 1 7 4 8 \n
0000016

And if I save snplist.txt without the header (so that bcftools-1.21 won't produce the error I wrote above), running cat snplist.txt | head -2 | od -c gives

0000000 1 \t 1 9 8 3 1 7 4 8 \n 1 \t 3 0 1
0000020 8 5 2 3 7 \n
0000026

and running cat snplist.txt | head -2 | hexdump -Cc gives

00000000 31 09 31 39 38 33 31 37 34 38 0a 31 09 33 30 31 |1.19831748.1.301|
0000000 1 \t 1 9 8 3 1 7 4 8 \n 1 \t 3 0 1
00000010 38 35 32 33 37 0a |85237.|
0000010 8 5 2 3 7 \n
0000016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants