Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong deletion masking for AAV task? #18

Open
dlnp2 opened this issue Aug 23, 2022 · 1 comment
Open

Wrong deletion masking for AAV task? #18

dlnp2 opened this issue Aug 23, 2022 · 1 comment

Comments

@dlnp2
Copy link

dlnp2 commented Aug 23, 2022

@sacdallago hi, thank you very much for your great data curation. I am planning to use the AAV dataset for my research.

I found that some deletion masks may not have been properly applied to the wild type sequences: as the image below shows, there are 29 sequences with different mutation_mask but with the same full_aa_sequnece as the wild type. Is this intended result?

スクリーンショット 2022-08-23 17 23 25

Below is the code for replication:

import pandas as pd
from Bio import SeqIO
wt_seq = str(next(SeqIO.parse("P03135.fasta", "fasta")).seq)
variant_effects = pd.read_csv("full_data.csv")
wild_types = variant_effects.loc[variant_effects["full_aa_sequence"] == wt_seq]
wild_types
@alex-hh
Copy link

alex-hh commented Sep 8, 2023

I believe these may be sequences containing stop codons, which are sometimes represented with '*' (and is implied by these sequences having the value 'stop' in the category column). There are a few extra variants containing stop codons that end up with different sequences to those above due to also containing other mutations. If that's right then I think (i) all such variants should be excluded from all splits, since models do not encode the stop codon so cannot predict the fitnesses of these sequences (ii) the README file https://github.com/J-SNACKKB/FLIP/tree/main/splits/aav should be corrected to say that "*" in mutation mask and mutated region means stop codon and not deletion.

To identify all such rows:

import pandas as pd

variant_effects = pd.read_csv("full_data.csv")
stop_variants = variant_effects[variant_effects["category"]=="stop"]

This is equivalent to selecting all variants in which the mutation mask contains "*":

stop_variants = variant_effects[variant_effects["mutation_mask"].apply(lambda x: "*" in x)]

Some of these sequences contain stop codons which are effectively 'insertions' and some contain stop codons which are 'substitutions'. The two cases aren't distinguished by mutation_mask.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants