-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to subset .vcf.gz file to include only variants whose genomic coordinates are given in a list #2332
Comments
The command looks correct. This is a very basic functionality, so it's strange it wouldn't work. Can you try to upgrade to the latest version of bcftools, we are at 1.21 now. If there is something wrong with the input data, the newer version might give some informative error messages. The -T option does not require an index, so it's unlikely that it is the problem. If upgrading does not help, can you provide a small test case for us to reproduce the problem? |
Hi, Thanks for the fast reply. I downloaded and installed version 1.21 but now I get an error message saying 'Could not parse 2-th line of file snplist.txt, using the columns 1,2[,3] Failed to read the targets: snplist.txt' Here is a head of snplist.txt: Head of hbcs_sisu_b38.vcf.gz would be quite massive so I copy-pasted here only seven first columns of the output when I run #CHROM POS ID REF ALT QUAL FILTER |
Let me know if you need more information to be able to reproduce the problem! |
Is your file tab-delimited as described in the documentation? http://samtools.github.io/bcftools/bcftools.html#common_options
|
Yes, I think it is tab-delimited. I'm not sure which way would be the best way to verify this, but if I run it prints And I get the same column names by running My snplist.txt is also tab-delimited. |
You are showing what the VCF looksl ike, but the problem would be with the site list. Try
or
|
Running 0000000 # C H R O M \t P O S \n 1 \t 1 9 8 and running 00000000 23 43 48 52 4f 4d 09 50 4f 53 0a 31 09 31 39 38 |#CHROM.POS.1.198| And if I save snplist.txt without the header (so that bcftools-1.21 won't produce the error I wrote above), running 0000000 1 \t 1 9 8 3 1 7 4 8 \n 1 \t 3 0 1 and running 00000000 31 09 31 39 38 33 31 37 34 38 0a 31 09 33 30 31 |1.19831748.1.301| |
Hi,
I would like to create a subset of a large .vcf.gz file so that I would be able to read it in R with
read.vcfR
from thevcfR
package (I get memory issues if I try to read the non-subsetted .vcf.gz file). I only need certain variants given in a list. What I have tried:~/bcftools-1.12/bcftools view -T snplist.txt hbcs_sisu_b38.vcf.gz -o hbcs_sisu_b38_subset.vcf.gz
The 'snplist.txt' is tab-delimited and includes columns '#CHROM' and 'POS' (not sure if they were required).
I have also tried option '-R' instead of '-T' for the 'view' command, and command 'filter' instead of 'view' with both options '-T' and '-R'. But depending on which variants are included in snplist.txt, in the subsetted there is always either just one variant or no variants at all, even though
less -S hbcs_sisu_b38.vcf.gz | grep -f snplist.txt
prints lines for more variants.
I am not sure if .csi file was required here, but I have created hbcs_sisu_b38.vcf.gz.csi like this:
~/bcftools-1.12/bcftools index hbcs_sisu_b38.vcf.gz
The text was updated successfully, but these errors were encountered: