Question about dataset curation relating to res_idx #13

wanghch123 · 2025-01-10T10:02:38Z

Hi.
I notice that in your provided processed dataset, all protein's residue index are labeled as "author_res_num", which may provide wrong sequence connectivities when applying index embedding. For example, for PDB entry 5yc8, its real sequence from _entity_poly_seq.num 215-221 is ...-SER-LYS-SER-ARG-ILE-ALA-ASP-... where SER-ARG-ILE are missing, but author labeled SER-LYS as 213-214 however ALA-ASP- as 1001-1002. There's such a huge gap between 214 and 1001 and their actual sequential distance is only 3. Moreover for this protein, labeled_res_num 329-330-... are labeled by author as 380-381-... which is much further from the sequence fragment -SER-LYS-SER-ARG-ILE- than ALA-ASP- as 1001-1002 however their residue index are closer. I wonder whether this embedding embeds wrong positional information and influence model's performance.
Thanks.

jasonkyuyim · 2025-01-13T19:59:45Z

Hi, is there a chain break, i.e. missing coordinates, where the sequence break is? It's been a while since I plumbed though PDB files so I'm not sure why we should use one or the other... I couldn't tell from your example if the relative index distances are the same in the two scenarios. We always re-number the residue indices anyways https://github.com/jasonkyuyim/multiflow/blob/main/multiflow/data/datasets.py#L86
so this isn't an issue as long as the relative index differences are the same regardless of ordering.

wanghch123 · 2025-01-14T05:41:12Z

Thanks for your quick reply!

Based on my understanding, this issue is typically caused by chain breaks. When there is a chain break, let’s denote the missing residue as dot '.'. Suppose there are two chain breaks; if we use author_res_num, the residue indices for a sequence would appear as follows:

A   A   A   A   .   .   .   .   A     A     A     .   .   .   A   A
15  16  17  18                 101   102    103              31  32

The input tensor would thus be [15, 16, 17, 18, 101, 102, 103, 31, 32]. That what your saved .pkl looks like.

However, considering the missing residues, the true residue indices should be:

 A   A   A   A     .    .      .      .        A    A     A       .      .      .       A     A
15  16  17  18    19   20     21     22       23   24    25      26     27     28      29    30

So, the input tensor should be [15, 16, 17, 18, 23, 24, 25, 29, 30].

It seems that re-numbering only ensures that the residue index starts at 1. Additionally, since we are using absolute embedding (link to node_feature_net.py), I’m uncertain whether this type of input might mislead the model or affect its performance. I believe it affects both absolute position encoding and relative position encoding.

I hope this clarifies the matter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about dataset curation relating to res_idx #13

Question about dataset curation relating to res_idx #13

wanghch123 commented Jan 10, 2025 •

edited

Loading

jasonkyuyim commented Jan 13, 2025

wanghch123 commented Jan 14, 2025 •

edited

Loading

Question about dataset curation relating to res_idx #13

Question about dataset curation relating to res_idx #13

Comments

wanghch123 commented Jan 10, 2025 • edited Loading

jasonkyuyim commented Jan 13, 2025

wanghch123 commented Jan 14, 2025 • edited Loading

wanghch123 commented Jan 10, 2025 •

edited

Loading

wanghch123 commented Jan 14, 2025 •

edited

Loading