update readme, add disjoint axioms, make confidence optional

sfluegel05 · sfluegel05 · commit 5797f87e6b8e · 2025-07-10T12:21:35.000+02:00
diff --git a/README.md b/README.md
@@ -1,5 +1,5 @@
 # python-chebifier
-An AI ensemble model for predicting chemical classes.
+An AI ensemble model for predicting chemical classes in the ChEBI ontology.
 
 ## Installation
 
@@ -23,39 +23,18 @@ The package provides a command-line interface (CLI) for making predictions using
 python -m chebifier.cli --help
 
 # Make predictions using a configuration file
-python -m chebifier.cli predict example_config.yml --smiles "CC(=O)OC1=CC=CC=C1C(=O)O" "C1=CC=C(C=C1)C(=O)O"
+python -m chebifier.cli predict configs/example_config.yml --smiles "CC(=O)OC1=CC=CC=C1C(=O)O" "C1=CC=C(C=C1)C(=O)O"
 
 # Make predictions using SMILES from a file
-python -m chebifier.cli predict example_config.yml --smiles-file smiles.txt
+python -m chebifier.cli predict configs/example_config.yml --smiles-file smiles.txt
 ```
 
 ### Configuration File
 
-The CLI requires a YAML configuration file that defines the ensemble model. Here's an example:
-
-```yaml
-# Example configuration file for Chebifier ensemble model
-
-# Each key in the top-level dictionary is a model name
-model1:
-  # Required: type of model (must be one of the keys in MODEL_TYPES)
-  type: electra
-  # Required: name of the model
-  model_name: electra_model1
-  # Required: path to the checkpoint file
-  ckpt_path: /path/to/checkpoint1.ckpt
-  # Required: path to the target labels file
-  target_labels_path: /path/to/target_labels1.txt
-  # Optional: batch size for predictions (default is likely defined in the model)
-  batch_size: 32
-
-model2:
-  type: electra
-  model_name: electra_model2
-  ckpt_path: /path/to/checkpoint2.ckpt
-  target_labels_path: /path/to/target_labels2.txt
-  batch_size: 64
-```
+The CLI requires a YAML configuration file that defines the ensemble model. An example can be found in `configs/example_config.yml`.
+
+The models and other required files are trained / generated by our [chebai](https://github.com/ChEB-AI/python-chebai) package. 
+Examples for models can be found on [kaggle](https://www.kaggle.com/datasets/sfluegel/chebai).
 
 ### Python API
 
@@ -77,10 +56,59 @@ smiles_list = ["CC(=O)OC1=CC=CC=C1C(=O)O", "C1=CC=C(C=C1)C(=O)O"]
 predictions = ensemble.predict_smiles_list(smiles_list)
 
 # Print results
-for smile, prediction in zip(smiles_list, predictions):
-    print(f"SMILES: {smile}")
+for smiles, prediction in zip(smiles_list, predictions):
+    print(f"SMILES: {smiles}")
     if prediction:
         print(f"Predicted classes: {prediction}")
     else:
         print("No predictions")
 ```
+
+### The ensemble
+
+Given a sample (i.e., a SMILES string) and models $m_1, m_2, \ldots, m_n$, the ensemble works as follows:
+1. Get predictions from each model $m_i$ for the sample.
+2. For each class $c$, aggregate predictions $p_c^{m_i}$ from all models that made a prediction for that class. 
+The aggregation happens separately for all positive predictions (i.e., $p_c^{m_i} \geq 0.5$) and all negative predictions
+($p_c^{m_i} < 0.5$). If the aggregated value is larger for the positive predictions than for the negative predictions,
+the ensemble makes a positive prediction for class $c$:
+
+$$
+\text{ensemble}(c) = \begin{cases} 
+1 & \text{if } \sum_{i: p_c^{m_i} \geq 0.5} [\text{confidence}_c^{m_i} \cdot \text{model_weight}_{m_i} \cdot \text{trust}_c^{m_i}] > \sum_{i: p_c^{m_i} < 0.5} [\text{confidence}_c^{m_i} \cdot \text{model_weight}_{m_i} \cdot \text{trust}_c^{m_i}] \\
+0 & \text{otherwise}
+\end{cases}
+$$
+
+Here, confidence is the model's (self-reported) confidence in its prediction, calculated as
+$$
+\text{confidence}_c^{m_i} = 2|p_c^{m_i} - 0.5|
+$$
+For example, if a model makes a positive prediction with $p_c^{m_i} = 0.55$, the confidence is $2|0.55 - 0.5| = 0.1$.
+One could say that the model is not very confident in its prediction and very close to switching to a negative prediction.
+If another model is very sure about its negative prediction with $p_c^{m_j} = 0.1$, the confidence is $2|0.1 - 0.5| = 0.8$.
+Therefore, if in doubt, we are more confident in the negative prediction.
+
+Confidence can be disabled by the `use_confidence` parameter of the predict method (default: True).
+
+The model_weight can be set for each model in the configuration file (default: 1). This is used to favor a certain 
+model independently of a given class. 
+Trust is based on the model's performance on a validation set. After training, we evaluate the Machine Learning models 
+on a validation set for each class. If the `ensemble_type` is set to `wmv-f1`, the trust is calculated as 1 + the F1 score.
+If the `ensemble_type` is set to `mv` (the default), the trust is set to 1 for all models.
+
+3. After a decision has been made for each class independently, the consistency of the predictions with regard to the ChEBI hierarchy 
+and disjointness axioms is checked. This is
+done in 3 steps:
+- (1) First, the hierarchy is corrected. For each pair of classes $A$ and $B$ where $A$ is a subclass of $B$ (following 
+the is-a relation in ChEBI), we set the ensemble prediction of $B$ to 1 if the prediction of $A$ is 1. Intuitively 
+speaking, if we have determined that a molecule belongs to a specific class (e.g., aromatic primary alcohol), it also
+belongs to the direct and indirect superclasses (e.g., primary alcohol, aromatic alcohol, alcohol).
+- (2) Next, we check for disjointness. This is not specified directly in ChEBI, but in an additional ChEBI module ([chebi-disjoints.owl](https://ftp.ebi.ac.uk/pub/databases/chebi/ontology/)).
+We have extracted these disjointness axioms into a CSV file and added some more disjointness axioms ourselves (see
+`data>disjoint_chebi.csv` and `data>disjoint_additional.csv`). If two classes $A$ and $B$ are disjoint and we predict
+both, we select one of them randomly and set the other to 0.
+- (3) Since the second step might have introduced new inconsistencies into the hierarchy, we repeat the first step, but 
+with a small change. For a pair of classes $A \subseteq B$ with predictions $1$ and $0$, instead of setting $B$ to $1$,
+we now set $A$ to $0$. This has the advantage that we cannot introduce new disjointness-inconsistencies and don't have
+to repeat step 2.
diff --git a/chebifier/cli.py b/chebifier/cli.py
@@ -26,6 +26,7 @@ def cli():
 @click.option('--output', '-o', type=click.Path(), help='Output file to save predictions (optional)')
 @click.option('--ensemble-type', '-e', type=click.Choice(ENSEMBLES.keys()), default='mv', help='Type of ensemble to use (default: Majority Voting)')
 @click.option("--chebi-version", "-v", type=int, default=241, help="ChEBI version to use for checking consistency (default: 241)")
+@click.option("--use-confidence", "-c", is_flag=True, default=True, help="Weight predictions based on how 'confident' a model is in its prediction (default: True)")
 def predict(config_file, smiles, smiles_file, output, ensemble_type, chebi_version):
     """Predict ChEBI classes for SMILES strings using an ensemble model.
     
diff --git a/chebifier/ensemble/base_ensemble.py b/chebifier/ensemble/base_ensemble.py
@@ -78,7 +78,10 @@ def consolidate_predictions(self, predictions, classwise_weights, **kwargs):
         positive_mask = (predictions > self.positive_prediction_threshold) & valid_predictions
         negative_mask = (predictions < self.positive_prediction_threshold) & valid_predictions
 
-        confidence = 2 * torch.abs(predictions.nan_to_num() - self.positive_prediction_threshold)
+        if "use_confidence" in kwargs and kwargs["use_confidence"]:
+            confidence = 2 * torch.abs(predictions.nan_to_num() - self.positive_prediction_threshold)
+        else:
+            confidence = torch.ones_like(predictions)
 
         # Extract positive and negative weights
         pos_weights = classwise_weights[0]  # Shape: (num_classes, num_models)
diff --git a/configs/example_config.yml b/configs/example_config.yml
@@ -0,0 +1,26 @@
+
+chemlog_peptides:
+    type: chemlog
+    model_weight: 100 # if chemlog is available, it always gets chosen
+my_resgated:
+  type: resgated
+  ckpt_path: my_resgated.ckpt # checkpoint trained with chebai
+  target_labels_path: ../python-chebai/data/chebi_v241/ChEBI50/processed/classes.txt # from the chebai dataset
+  molecular_properties: # list of properties used during training
+      - chebai_graph.preprocessing.properties.AtomType
+      - chebai_graph.preprocessing.properties.NumAtomBonds
+      - chebai_graph.preprocessing.properties.AtomCharge
+      - chebai_graph.preprocessing.properties.AtomAromaticity
+      - chebai_graph.preprocessing.properties.AtomHybridization
+      - chebai_graph.preprocessing.properties.AtomNumHs
+      - chebai_graph.preprocessing.properties.BondType
+      - chebai_graph.preprocessing.properties.BondInRing
+      - chebai_graph.preprocessing.properties.BondAromaticity
+      - chebai_graph.preprocessing.properties.RDKit2DNormalized
+  #classwise_weights_path: my_resgated_metrics.json # can be calculated with chebai.results.generate_class_properties
+
+my_electra:
+  type: electra
+  ckpt_path: my_electra.ckpt
+  target_labels_path: ../python-chebai/data/chebi_v241/ChEBI50/processed/classes.txt
+  #classwise_weights_path: my_electra_metrics.json # can be calculated with chebai.results.generate_class_properties
diff --git a/data/disjoint_additional.csv b/data/disjoint_additional.csv
@@ -0,0 +1,14 @@
+16670,60466
+16670,60194
+16670,60334
+60194,60466
+60334,60466
+60194,60334
+15841,25676
+46761,47923
+46761,48030
+46761,48545
+47923,48030
+47923,48545
+48030,48545
+90799,155837
diff --git a/data/disjoint_chebi.csv b/data/disjoint_chebi.csv