Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about dataset curation relating to res_idx #13

Open
wanghch123 opened this issue Jan 10, 2025 · 2 comments
Open

Question about dataset curation relating to res_idx #13

wanghch123 opened this issue Jan 10, 2025 · 2 comments

Comments

@wanghch123
Copy link

wanghch123 commented Jan 10, 2025

Hi.
I notice that in your provided processed dataset, all protein's residue index are labeled as "author_res_num", which may provide wrong sequence connectivities when applying index embedding. For example, for PDB entry 5yc8, its real sequence from _entity_poly_seq.num 215-221 is ...-SER-LYS-SER-ARG-ILE-ALA-ASP-... where SER-ARG-ILE are missing, but author labeled SER-LYS as 213-214 however ALA-ASP- as 1001-1002. There's such a huge gap between 214 and 1001 and their actual sequential distance is only 3. Moreover for this protein, labeled_res_num 329-330-... are labeled by author as 380-381-... which is much further from the sequence fragment -SER-LYS-SER-ARG-ILE- than ALA-ASP- as 1001-1002 however their residue index are closer. I wonder whether this embedding embeds wrong positional information and influence model's performance.
Thanks.

@jasonkyuyim
Copy link
Owner

Hi, is there a chain break, i.e. missing coordinates, where the sequence break is? It's been a while since I plumbed though PDB files so I'm not sure why we should use one or the other... I couldn't tell from your example if the relative index distances are the same in the two scenarios. We always re-number the residue indices anyways https://github.com/jasonkyuyim/multiflow/blob/main/multiflow/data/datasets.py#L86
so this isn't an issue as long as the relative index differences are the same regardless of ordering.

@wanghch123
Copy link
Author

wanghch123 commented Jan 14, 2025

Thanks for your quick reply!

Based on my understanding, this issue is typically caused by chain breaks. When there is a chain break, let’s denote the missing residue as dot '.'. Suppose there are two chain breaks; if we use author_res_num, the residue indices for a sequence would appear as follows:

A   A   A   A   .   .   .   .   A     A     A     .   .   .   A   A
15  16  17  18                 101   102    103              31  32

The input tensor would thus be [15, 16, 17, 18, 101, 102, 103, 31, 32]. That what your saved .pkl looks like.

However, considering the missing residues, the true residue indices should be:

 A   A   A   A     .    .      .      .        A    A     A       .      .      .       A     A
15  16  17  18    19   20     21     22       23   24    25      26     27     28      29    30

So, the input tensor should be [15, 16, 17, 18, 23, 24, 25, 29, 30].

It seems that re-numbering only ensures that the residue index starts at 1. Additionally, since we are using absolute embedding (link to node_feature_net.py), I’m uncertain whether this type of input might mislead the model or affect its performance. I believe it affects both absolute position encoding and relative position encoding.

I hope this clarifies the matter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants