Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Highly similar structures are clustered into separate groups or result in error #371

Open
NatureGeorge opened this issue Oct 22, 2024 · 2 comments

Comments

@NatureGeorge
Copy link
Contributor

Expected Behavior

Given a directory containing the PDB files with the following PDB IDs:

8G2V,7UZI

Among them, the chain instances (A, B, C, D, E, F, G, H, I, J) of 8G2V share nearly identical structure and thus should be clustered into the same group.

Current Behavior

Each chain instance of 8G2V be a idependent cluster.

Steps to Reproduce (for bugs)

foldseek easy-cluster 8G2V.cif.gz 7UZI.cif.gz result tmp --tmscore-threshold 0.5

Context

There would be another problem if run:

foldseek easy-cluster 8G2V.cif.gz result tmp --tmscore-threshold 0.5

giving:

easy-cluster 8G2V.cif.gz result tmp --tmscore-threshold 0.5

MMseqs Version:                         9.427df8a
Substitution matrix                     aa:3di.out,nucl:3di.out
Seed substitution matrix                aa:3di.out,nucl:3di.out
Sensitivity                             4
k-mer length                            0
Target search mode                      0
k-score                                 seq:2147483647,prof:2147483647
Max sequence length                     65535
Max results per query                   300
Split database                          0
Split mode                              2
Split memory limit                      0
Coverage threshold                      0
Coverage mode                           0
Compositional bias                      1
Compositional bias                      1
Diagonal scoring                        true
Exact k-mer matching                    0
Mask residues                           1
Mask residues probability               0.9
Mask lower case residues                1
Minimum diagonal score                  30
Selected taxa
Spaced k-mers                           1
Preload mode                            0
Spaced k-mer pattern
Local temporary path
Threads                                 20
Compressed                              0
Verbosity                               3
TMscore threshold                       0.5
LDDT threshold                          0
Sort by structure bit score             1
Alignment type                          2
Exact TMscore                           0
Add backtrace                           false
Alignment mode                          0
Alignment mode                          0
E-value threshold                       10
Seq. id. threshold                      0
Min alignment length                    0
Seq. id. mode                           0
Alternative alignments                  0
Max reject                              2147483647
Max accept                              2147483647
Gap open cost                           aa:10,nucl:10
Gap extension cost                      aa:1,nucl:1
TMalign hit order                       0
TMalign fast                            1
Cluster mode                            0
Max connected component depth           1000
Similarity type                         2
Weight file name
Cluster Weight threshold                0.9
Single step clustering                  false
Cascaded clustering steps               3
Cluster reassign                        false
Remove temporary files                  true
Force restart with latest tmp           false
MPI runner
k-mers per sequence                     21
Scale k-mers per sequence               aa:0.000,nucl:0.200
Adjust k-mer length                     false
Shift hash                              67
Include only extendable                 false
Skip repeating k-mers                   false
Rescore mode                            0
Remove hits by seq. id. and coverage    false
Sort results                            0
Path to ProstT5
Chain name mode                         0
Write mapping file                      0
Mask b-factor threshold                 0
Coord store mode                        2
Write lookup file                       1
Input format                            0
File Inclusion Regex                    .*
File Exclusion Regex                    ^$

cluster tmp/7126666531623036926/input tmp/7126666531623036926/clu tmp/7126666531623036926/clu_tmp --tmscore-threshold 0.5 --remove-tmp-files 1

Set cluster sensitivity to -s 8.000000
Set cluster mode SET COVER
Set cluster iterations to 3
tmp/7126666531623036926/clu_tmp/4050237725070610072/input_step_redundancy_ca exists and will be overwritten
createsubdb tmp/7126666531623036926/clu_tmp/4050237725070610072/clu_redundancy tmp/7126666531623036926/input_ca tmp/7126666531623036926/clu_tmp/4050237725070610072/input_step_redundancy_ca -v 3 --subdb-mode 1

Time for merging to input_step_redundancy_ca: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 1ms
prefilter tmp/7126666531623036926/clu_tmp/4050237725070610072/input_step_redundancy_ss tmp/7126666531623036926/clu_tmp/4050237725070610072/input_step_redundancy_ss tmp/7126666531623036926/clu_tmp/4050237725070610072/pref_step0 --sub-mat 'aa:3di.out,nucl:3di.out' --seed-sub-mat 'aa:3di.out,nucl:3di.out' -s 1 -k 0 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 100 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 0 --comp-bias-corr-scale 1 --diag-score 0 --exact-kmer-matching 0 --mask 0 --mask-prob 0.9 --mask-lower-case 1 --min-ungapped-score 0 --add-self-matches 1 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 20 --compressed 0 -v 3

Query database size: 10 type: Aminoacid
Estimated memory consumption: 977M
Target database size: 10 type: Aminoacid
Index table k-mer threshold: 154 at k-mer size 6
Index table: counting k-mers
[=================================================================] 100.00% 10 0s 0ms
Index table: Masked residues: 0
No k-mer could be extracted for the database tmp/7126666531623036926/clu_tmp/4050237725070610072/input_step_redundancy_ss.
Maybe the sequences length is less than 14 residues.
Error: Prefilter step 0 died
Error: Search died

Your Environment

  • Which foldseek version was used (Statically-compiled, self-compiled, Conda, etc.): conda 9.427df8a
@hughhigin
Copy link

The two things I suspect (not on the team but have had similar issues) are the prefiltering step and the very low tm-score threshold you have.

The prefiltering step groups similar proteins before doing the full alignment to save time, which can be less stringent than the threshold you want to group. Since you're still working with a small number of proteins you can just add '--exhaustive-search' to skip this step entirely, though it may become quite slow if you move to larger datasets.

In this case you should also probably have a much stricter tm-score cutoff. I would start with something like 0.8 which I think is a reasonable threshold for homologous proteins but you'll probably need trial and error.

Last thought is to check the qtmscore and ttmscore output since the proteins in the first pdb are quite short, so you might get the signal you want normalizing the tmscore by one protein vs. the other.

@NatureGeorge
Copy link
Contributor Author

Hi @hughhigin. Thank you so much for your detailed explanation and helpful tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants