-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about dataset curation relating to res_idx #13
Comments
Hi, is there a chain break, i.e. missing coordinates, where the sequence break is? It's been a while since I plumbed though PDB files so I'm not sure why we should use one or the other... I couldn't tell from your example if the relative index distances are the same in the two scenarios. We always re-number the residue indices anyways https://github.com/jasonkyuyim/multiflow/blob/main/multiflow/data/datasets.py#L86 |
Thanks for your quick reply! Based on my understanding, this issue is typically caused by chain breaks. When there is a chain break, let’s denote the missing residue as dot '.'. Suppose there are two chain breaks; if we use author_res_num, the residue indices for a sequence would appear as follows:
The input tensor would thus be [15, 16, 17, 18, 101, 102, 103, 31, 32]. That what your saved .pkl looks like. However, considering the missing residues, the true residue indices should be:
So, the input tensor should be [15, 16, 17, 18, 23, 24, 25, 29, 30]. It seems that re-numbering only ensures that the residue index starts at 1. Additionally, since we are using absolute embedding (link to node_feature_net.py), I’m uncertain whether this type of input might mislead the model or affect its performance. I believe it affects both absolute position encoding and relative position encoding. I hope this clarifies the matter. |
Hi.
I notice that in your provided processed dataset, all protein's residue index are labeled as "author_res_num", which may provide wrong sequence connectivities when applying index embedding. For example, for PDB entry 5yc8, its real sequence from _entity_poly_seq.num 215-221 is ...-SER-LYS-SER-ARG-ILE-ALA-ASP-... where SER-ARG-ILE are missing, but author labeled SER-LYS as 213-214 however ALA-ASP- as 1001-1002. There's such a huge gap between 214 and 1001 and their actual sequential distance is only 3. Moreover for this protein, labeled_res_num 329-330-... are labeled by author as 380-381-... which is much further from the sequence fragment -SER-LYS-SER-ARG-ILE- than ALA-ASP- as 1001-1002 however their residue index are closer. I wonder whether this embedding embeds wrong positional information and influence model's performance.
Thanks.
The text was updated successfully, but these errors were encountered: