Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

foldseek result2msa a3m database contains unusable, non-significant alignments due to poor choice of cluster representative #402

Open
shiraz-shah opened this issue Dec 17, 2024 · 0 comments

Comments

@shiraz-shah
Copy link

shiraz-shah commented Dec 17, 2024

The fact that foldseek is even able to make MSAs based on the clustering result itself is extremely impressive. However, for the following issue makes the alignments unusable:

Expected Behavior

For sequences within result2msa a3m files to align well to each other, since they belong to the same cluster, and have been structurally aligned with conservative clustering criteria (e.g. -c 0.8)

Current Behavior

Currently, most alignments in MSAs for some clusters are unusable because they look like this:

>OTU_463_48
MADTYTIQFRRGMYSDFDTSKIRPGEPVAILGNDPSVPSGKALYIAFAANDVRRLCSIEDISEMVNAGEFVGPQGPKGEKGEKGDKGAAGPTGPQGPKG
>CP103806_45    48      0.415   8.734E-04       67      131     354     283     369     393
-------------------------------------------------------------------GEK-GDPGAKGEKgdpGEpgqpgepgtKGEKG
>OTU_624_60     38      0.370   9.605E-01       86      164     354     207     264     356
--------------------------------------------------------------------------------------TAAGVPGERGAPG
>OTU_471_66     38      0.370   9.605E-01       86      164     354     207     264     356
--------------------------------------------------------------------------------------TAAGVPGERGAPG
>OTU_4349_27     22      0.666   4.795E+04       45      53      354     283     291     418
---------------------------------------------AWAANRLRR---------------------------------------------
>OTU_4628_19     22      0.666   4.795E+04       45      53      354     283     291     435
---------------------------------------------AWAANRLRR---------------------------------------------

with very few aligned residues (way below -c 0.8) and e-values well above the specified cutoff (-e 0.001).

Steps to Reproduce (for bugs)

  • Perform foldseek clustering with foldseek cluster
  • Create clustering tsv file with foldseek createtsv on clustering result db
  • Create a3m db with foldseek result2msa DB DB DB_C a3m --msa-format-mode 6 on clustering result db
  • Browse a3m db ffdata file with less until one encounters e-values above 1

Context

One would expect cluster members to match each other better than the above example, as the e-value cutoff was set to 0.001 and -c was set to 0.8. While it is understandable that not all members of a cluster can match each other equally well, it's still hard to reconcile why the cluster representative itself matches no other cluster member within the specified thresholds. Selecting a better cluster representative would be a low-effort way to vastly improve alignment quality.

Your Environment

foldseek Version: 9.427df8a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant