Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Covalent ligands: (non)leaving atoms and unrealistic bond lengths #159

Closed
alephreish opened this issue Dec 2, 2024 · 23 comments
Closed
Labels
question Further information is requested

Comments

@alephreish
Copy link

alephreish commented Dec 2, 2024

I'm trying to fold proteins with covalent ligands and face the following problems:

  • _chem_comp_atom.pdbx_leaving_atom_flag is ignored, such that the resulting structures contain all atoms of the ligand including those should have been removed - this can be remedied by removing them beforehand
  • the resulting bond lengths are incorrect (see below) - I haven't found a remedy for this

A patched version of helixfold3 also supports covalent bonds: it suffers from the first, less severe problem but it does produce realistic bond lengths.

I'm playing with bacteriorhodopsin as a test case. The leaving atom of the RET ligand is O1 and the linking atom is C15 which covalently binds to the NZ atom of a specific lysine residue. The affected bonds are: "C15 - NZ" which is the bond between the ligand and the linking atom of the amino acid and "NZ - CE" which is the bond between the linking atom of the amino acid and the next carbon of the side chain (see the snapshots below). The resulting bond lengths are as follows:

C15 - NZ NZ - CE structure
5ZIM 1.350 1.482 ground truth
HF3 1.228 1.497 helixfold3 with the O1 atom removed
AF3 1.110 0.960 alphafold3 with the O1 atom removed
AF3 (with O1) 1.143 0.968 alphafold with the O1 in place (the default behavior of AF3)

As can be seen, both bonds are affected but the bond within the side chain of the amino acid ("NZ - CE") is affected even more which is really weird.

Below are snapshots from pymol. The ligand is in red, the protein in orange and the linking atom of the lysine residue is in blue. Notice that , the structure of ligand+lysine is much closer to reality in the case of helixfold3. (Notice that pymol automatically connects the atoms C15 and CE in the AF3 structures because they are way too close to each either. Weird things happen with other software as well: e.g. if you try to convert the structures with Open Babel, it incorrectly interprets the "NZ - CE" bond in the AF3 structures as double):

5ZIM HF3
5zim hf3
AF3 AF3 with O1
af3 af3_O1

Attached is AF3.json which has userCCD containing the ligand with the leaving atoms (O1) and hydrogens removed. For the default behavior of AF3, remove userCCD.

@Augustin-Zidek Augustin-Zidek added the question Further information is requested label Dec 2, 2024
@alephreish
Copy link
Author

alephreish commented Dec 2, 2024

Here is a different test case: 5AQD, PEB bound to cystein (no atoms leaving).

CAA - SG SG - CB
5AQD 1.801 1.819
HF3 1.775 1.827
AF3 1.773 1.345

"CAA - SG" is the bond between the ligand and Cys82 and "SG - CB" is the bond between the linking sulphur atom and the next carbon atom in the side chain. The "CAA - SG" bond is reasonably close to the expectation but the "SG - CB" is missing ~0.5A - roughly the same difference as in the case of the "NZ - CE" bond above.

@sunshine20241203
Copy link

Hi, I am recently performing AF3 for covalent interaction prediction, could you share that how you build your CCD file for you ligand? Mannually or by some tool?

@alephreish
Copy link
Author

All of the ligands I'm interested in are either available in ccd or can be derived from those. For the task of removing the leaving atoms, I use pdbeccdutils.

@sunshine20241203
Copy link

sunshine20241203 commented Dec 3, 2024 via email

@alephreish

This comment was marked as off-topic.

@sunshine20241203
Copy link

sunshine20241203 commented Dec 3, 2024 via email

@YoavShamir5
Copy link

I am getting similar results in terms of unrealistic bond lengths. In the following example, I am predicting a covalent complex (PDB id: 5P9J). In my case, I downloaded the cif of the covalent ligand, and then provided it as a user-provided CCD. The issue is similar to what @alephreish described - the length of the bond within the cysteine sidechain (CB-SG bond) is predicted with an unrealistic length of 1.1A. The covalent attachment bond (connecting cysteine SG with atom CAA of the ligand) has a reasonable length. I should note that the prediction complex and the ground truth overlay very well in terms of RMSD (see figure), so it is possible that this is simply the performance of AF3 on covalent complexes - good RMSD, with a single unrealistic bond length:

image

@joshabramson
Copy link
Collaborator

Thanks for these detailed reports.

AF3 only handles leaving atoms properly for Glycans - for general polymer-ligand covalent bonds we did not drop the leaving atoms from the predictions during training or inference. Because during training no position losses were applied for the excess atom, and in some cases the leaving atom is not inlcuded in the molecule definition, it is not too surprising that the positions of the other non-leaving atoms are not greatly affected by inclusion of exclusion of the leaving atom. Overall, manually removing the atom after running the prediction is probably fine.

Regarding the bad bond lengths around the covalent bond, this might be resolvable by running a short relaxation or idealisation procedure on the output. We tried this by manually removing the O1 atom and idealising in REFMAC5 and it resolved the issue (note that we had to convert the output from cif to pdb format for it to work in REFMAC5).

@alephreish
Copy link
Author

@joshabramson Thanks for the response. Sure, minimization does fix the bonds (we've doing it with gromacs). A fix on the side of alphafold3 itself would be cleaner though.

@joshabramson
Copy link
Collaborator

Apologies, we do not currently plan to make model upgrades for this repository.

@joshabramson
Copy link
Collaborator

However, one can try the following to resolve things without changing the model: define a custom ccd code in the input and use that as a modified residue ptm. Two things to try:

  1. Replace the bonded LYS residue by a custom version that is actually identical to the original (must use a different non-existing code, e.g. LYS-mod) - this will force the model to view the residue as a modified residue and thus use one model 'token' per atom, rather than putting the whole residue into one model token. This might affect the quality of the bond.

  2. Replace the whole LYS+RET combination by a single large molecule and set that as the modified residue (replacing the original LYS). The covalent bond will need to be internal to the molecule definition, and the O1 atom should not be present.

@joshabramson joshabramson reopened this Dec 7, 2024
@alephreish
Copy link
Author

alephreish commented Dec 7, 2024

@joshabramson The second option (to represent the entire part of residue+ligand as a single modified residue using modifications) works correctly: the bond lengths are fairly close to the ground truth. Thanks a lot for the idea! This is a very straightforward approach for ligands covalently bound to a single residue. For ligand moieties attached to multiple residues simultaneously (e.g. /5aqd/KC/R/PEB`188/ in 5AQD), one has to explore how modifications plays out with bondedAtomPairs.

@YoavShamir5 and anyone else interested in implementing this:

Some residue+ligand combinations are directly available in ccd (this is the case e.g. with LYS+RET=LYR). alphafold3 does remove the leaving atom (OXT) in this case. In other cases: If some existing structures have the ligand of interest, the easiest is probably to take the coordinates and the bonds for the ligand and the residue from the real structures. Otherwise, one has to connect the ligand with its residue manually or programmatically.

@YoavShamir5
Copy link

Thanks a lot, @alephreish and @joshabramson! Really helpful.
Regarding the second method suggested - I wonder if this enables one to run MSA and template search (run_data_pipeline) just once for a protein, and then predict (run_inference) for multiple covalent ligands using those results. I guess one can use the unmodified protein sequence for MSA, and then for each ligand of interest, one can modify the json generated by run_data_pipeline (manually modifying the "modifications" field), as input to the inference step. Does this sound reasonable? Or should residue modifications be included in the data pipeline (in which case - the data step should be run for each covalent ligand of choice).

@joshabramson
Copy link
Collaborator

joshabramson commented Dec 8, 2024

Note that msa and templates can be given directly as input via the input json, which stops automatic computation of the msa/templates, so one only needs to compute them once then can reuse.

However there are currently issues with msa for modified residues that may get in the way, see #54

@YoavShamir5
Copy link

@joshabramson thanks a lot for your input, I am not sure I am totally clear about your point - I suggested running run_inference on the unmodified sequence, and then modifying the generated output json by mentioning a different modification each time we run inference on a new ligand. Is this what you suggest to do in cases where multiple covalent ligands are to be predicted with the same protein?

@joshabramson
Copy link
Collaborator

Yes, you can use the output from the first inference to give you msa and templates for future runs where you are only switching one modified residue in the input.

@alephreish
Copy link
Author

I've been checking out the first option (to force the ligand-binding residue to be represented as atoms by renaming it with modifications and to connect it to the ligand with bondedAtomPairs) as it would be more flexible. From my very preliminary experiments, it seems to work to some degree: The bond lengths are only a bit shorter but the angles between the bonds around the residue-ligand connection are off. I'm planning to stick with the second option, but if someone wants to continue exploring this: to avoid #54 (before it is properly fixed), use a residue name in "ptmType" which maps to the original amino acid in CCD_NAME_TO_ONE_LETTER (e.g. for arginine, pick e.g. "modifications": [ { "ptmType": "AR7" } ] and re-define AR7 in userCCD).

@Augustin-Zidek
Copy link
Collaborator

@alephreish I fixed #54 today.

@YoavShamir5
Copy link

Regarding the second option (user-provided CCD of the ligand + residue) - is there any documentation regarding preparation of this CCD? I guess it is the same as the general documentation for user-provided CCDs, but for example, is it important for atom names of the residues to be preserved (residue's CB must be named CB etc.)?

@alephreish
Copy link
Author

I personally use pymol API (e.g. to fuse the ligand with the amino acid or to remove leaving atoms) but the cif it outputs is oversimplified. To make to it work, I have to combine it with gemmi to update the cif with new coordinates, atoms and bonds. There are probably better ways of doing it.

I do not think atom names matter at all as long as the elements are correct. They might be important for downstream analyses though. E.g. in my pipeline, I have a post-processing step which separates the ligand from its amino acid into a residue of its own, at which point I also rename the atoms of the ligand if needed.

@YoavShamir5
Copy link

Thanks a lot for the advice, @alephreish! Finally, do you know if the user-provided PTM-residue should be in its original form prior to formation of the polypeptide (e.g. it should include the OH group) or not? I guess that it should, because CCD residue entries do have this form. In any case, I'd recommend to the AF3 team to add a few lines in the documentation for users interested in user-provided PTMs.

@alephreish
Copy link
Author

@YoavShamir5 The modified residue should have the N, CA, C, O and OXT atoms, the rest is up to you. The bond with the ligand should be as you want it to be. If it might be interesting to you or anyone else, I can release the code I use to fuse the ligands with the amino acids and convert it to ccd cif.

@YoavShamir5
Copy link

Thanks a lot @alephreish, I can generate the fused residue-ligand on my side. Regarding minimization of the original covalent option via GROMACS or REFMAC5 - did you enforce distance restraints between the covalently bound atom of the ligand and the covalent residue of the protein?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants