-
Notifications
You must be signed in to change notification settings - Fork 77
Description
Describe the bug
When template alignment matches return templates with short sequences, an error occurs with the alignment from kalign2.
To Reproduce
Example query input
{
"queries": {
"query1": {
"chains": [
{
"molecule_type": "protein",
"chain_ids": [
"A"
],
"sequence": "AQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR"
},
{
"molecule_type": "protein",
"chain_ids": [
"B"
],
"sequence": "KKAVINGEQIRSISDLHQTLKKELALPEYYGENLDALWDALTGWVEYPLVLEWRQFEQSKQLTENGAESVLQVFREAKAEGADITIILS"
},
}
Expected behavior
We should be able to handle sequence alignments against short sequences
Stack trace
IndexError Traceback (most recent call last)
Cell In[26], [line 9](vscode-notebook-cell:?execution_count=26&line=9)
6 template_sequence = "GC"
8 parser = A3mParser(max_sequences=None)
----> [9](vscode-notebook-cell:?execution_count=26&line=9) parser(
10 f">query_X/1-{len(query_seq_str)}\n{query_seq_str}\n>{template_entry_id}/{1}-{len(template_sequence)}\n{template_sequence}\n",
11 query_seq_str,
12 realign=True,
13 )
File ~/github/openfold3/openfold3/core/data/io/sequence/template.py:739, in A3mParser.__call__(self, alignment_source, query_seq_str, realign)
734 realigned_str = run_kalign(all_sequences)
735 realigned_alignments, _ = parse_fasta(realigned_str)
737 return self._process_alignment_hits(
738 query_seq_str=query_seq_str,
--> [739](https://file+.vscode-resource.vscode-cdn.net/Users/jennifer/github/openfold3/~/github/openfold3/openfold3/core/data/io/sequence/template.py:739) query_aln_str=realigned_alignments[0],
740 template_alignments=realigned_alignments[1:],
741 headers=headers, # Use original headers for metadata
742 )
IndexError: list index out of range
Configuration (please complete the following information):
kalign2installed with mamba, e.g.mamba install kalign2 -c bioconda
Additional context
The templates requested include /tmp/of3_template_data/template_structures/1rnb.cif which has two sequences:
chain A: 'GC'
chain B: AQVINTFDGVADYLQTYHKLPNDYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR
I'm not sure why the template alignment is for chain A rather than the closer chain B. Perhaps there is a mismatch of sequences in the colabfold database for PDB:1RNB, but I haven't examined this further.
The minimal reproduction of this issue can be isolated to this part of the code where the template preprocessor performs realignment between the query sequence and the parsed sequence from the matched template structure
Minimal code to reproduce template alignment error
from openfold3.core.data.primitives.structure.metadata import get_asym_id_to_canonical_seq_dict
from openfold3.core.data.io.sequence.template import A3mParser
query_seq_str = "AQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR"
template_entry_id = "dummy_A"
template_sequence = "GC"
parser = A3mParser(max_sequences=None)
parser(
f">query_X/1-{len(query_seq_str)}\n{query_seq_str}\n>{template_entry_id}/{1}-{len(template_sequence)}\n{template_sequence}\n",
query_seq_str,
realign=True,
)
Possible workaround
Upgrade kalign depedency to kalign3, which has better support for alignment with short sequences.
A naive upgrade to kalign3 after uninstalling kalign2 suggests that an update will be required to core.data.io.sequence.fasta.parse_fasta to handle the updated header on kalign3 files.
Example of new kalign3 header
Kalign (3.4.0)
Copyright (C) 2006,2019,2020,2021,2023 Timo Lassmann
This program comes with ABSOLUTELY NO WARRANTY; for details type:
`kalign -showw'.
This is free software, and you are welcome to redistribute it
under certain conditions; consult the COPYING file for details.
Please cite:
Lassmann, Timo.
"Kalign 3: multiple sequence alignment of large data sets."
Bioinformatics (2019)
https://doi.org/10.1093/bioinformatics/btz795
[2026-01-27 17:07:58] : LOG : Detected protein sequences.
[2026-01-27 17:07:58] : WARNING : -------------------------------------------- (/opt/mambaforge/envs/bioconda/conda-bld/kalign3_1733863649254/work/lib/src/msa_op.c line 522)
[2026-01-27 17:07:58] : WARNING : All input sequences have the same length. (/opt/mambaforge/envs/bioconda/conda-bld/kalign3_1733863649254/work/lib/src/msa_op.c line 523)
[2026-01-27 17:07:58] : WARNING : BUT there are no gap characters. (/opt/mambaforge/envs/bioconda/conda-bld/kalign3_1733863649254/work/lib/src/msa_op.c line 524)
[2026-01-27 17:07:58] : WARNING : (/opt/mambaforge/envs/bioconda/conda-bld/kalign3_1733863649254/work/lib/src/msa_op.c line 525)
[2026-01-27 17:07:58] : WARNING : Unable to determine whether the sequences (/opt/mambaforge/envs/bioconda/conda-bld/kalign3_1733863649254/work/lib/src/msa_op.c line 526)
[2026-01-27 17:07:58] : WARNING : are already aligned. (/opt/mambaforge/envs/bioconda/conda-bld/kalign3_1733863649254/work/lib/src/msa_op.c line 527)
[2026-01-27 17:07:58] : WARNING : Kalign will align the sequences. (/opt/mambaforge/envs/bioconda/conda-bld/kalign3_1733863649254/work/lib/src/msa_op.c line 528)
[2026-01-27 17:07:58] : WARNING : -------------------------------------------- (/opt/mambaforge/envs/bioconda/conda-bld/kalign3_1733863649254/work/lib/src/msa_op.c line 529)
[2026-01-27 17:07:58] : LOG : Read 2 sequences from standard input.
[2026-01-27 17:07:58] : LOG : CPU Time: 0.00u 00:00:00.00 Elapsed: 00:00:00.00
[2026-01-27 17:07:58] : LOG : Calculating pairwise distances
[2026-01-27 17:07:58] : LOG : CPU Time: 0.00u 00:00:00.00 Elapsed: 00:00:00.00
[2026-01-27 17:07:58] : LOG : Building guide tree.
[2026-01-27 17:07:58] : LOG : CPU Time: 0.00u 00:00:00.00 Elapsed: 00:00:00.00
[2026-01-27 17:07:58] : LOG : Aligning
[2026-01-27 17:07:58] : LOG : CPU Time: 0.00u 00:00:00.00 Elapsed: 00:00:00.00
>query
AQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNRE
GKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR
>1b27_B
AQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNRE
GKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR
Same alignment in kalign2
>query
AQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNRE
GKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR
>1b27_B
AQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNRE
GKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR