You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@sacdallago hi, thank you very much for your great data curation. I am planning to use the AAV dataset for my research.
I found that some deletion masks may not have been properly applied to the wild type sequences: as the image below shows, there are 29 sequences with different mutation_mask but with the same full_aa_sequnece as the wild type. Is this intended result?
I believe these may be sequences containing stop codons, which are sometimes represented with '*' (and is implied by these sequences having the value 'stop' in the category column). There are a few extra variants containing stop codons that end up with different sequences to those above due to also containing other mutations. If that's right then I think (i) all such variants should be excluded from all splits, since models do not encode the stop codon so cannot predict the fitnesses of these sequences (ii) the README file https://github.com/J-SNACKKB/FLIP/tree/main/splits/aav should be corrected to say that "*" in mutation mask and mutated region means stop codon and not deletion.
Some of these sequences contain stop codons which are effectively 'insertions' and some contain stop codons which are 'substitutions'. The two cases aren't distinguished by mutation_mask.
@sacdallago hi, thank you very much for your great data curation. I am planning to use the AAV dataset for my research.
I found that some deletion masks may not have been properly applied to the wild type sequences: as the image below shows, there are 29 sequences with different
mutation_mask
but with the samefull_aa_sequnece
as the wild type. Is this intended result?Below is the code for replication:
The text was updated successfully, but these errors were encountered: