LoGoBERT-PPI

LoGoBERT-PPI is a protein–protein interaction (PPI) prediction framework based on protein language model embeddings and late-interaction scoring.
The model enables scalable all-by-all inference across large proteomes by decoupling protein embedding from interaction scoring.

This repository contains the training and inference code used in the LoGoBERT-PPI study.

Overview

LoGoBERT-PPI follows a two-stage pipeline designed for scalable interaction inference:

Protein embedding

Protein sequences are encoded using a pretrained protein language model (ESM2).
Embeddings are computed once and cached, allowing reuse across multiple inference tasks.
Pairwise interaction inference

Interaction scores are computed using late interaction (MaxSim-style) scoring between token-level embeddings.
Since embeddings are precomputed, interaction scoring can be performed efficiently in large batches.

By separating embedding and interaction scoring, LoGoBERT-PPI enables scalable inference compared to cross-encoder architectures that jointly encode protein pairs.

Installation

Clone the repository:

git clone https://github.com/netbiolab/LoGoBERT-PPI.git
cd LoGoBERT-PPI

Conda Environment (recommended)

Create and activate the conda environment:

conda env create -f environment.yml
conda activate logobert

Insatll this repository as an editable package:

pip install -e .

Requirements

Python ≥ 3.9
PyTorch ≥ 2.0
transformers
huggingface_hub
numpy
pandas
tqdm
biopython
scikit-learn
wandb (optional, for training) Exact versions are pinned in environment.yml for reproducibility.

Model Weights

Pretrained models are available on Hugging Face:

👉 https://huggingface.co/hbeen/LoGoBERT-PPI-Eukaryote

Models can be loaded directly using:

from logobert.model.LoGo_BERT import LoGo_BERT

model = LoGo_BERT.from_pretrained(
    "netbiolab/LoGoBERT-PPI-Eukaryote"
)

The tokenizer is loaded from the base protein language model (ESM2).

Training

Example multi-GPU training:

CUDA_VISIBLE_DEVICES=0,1,2 torchrun --nproc_per_node=3 \
script/train_logobert.py \
  --train_path test_train_data/train/test_train.tsv \
  --val_path test_train_data/train/test_val.tsv \
  --model_name facebook/esm2_t33_650M_UR50D \
  --embedding_dim 512 \
  --max_length 512 \
  --batch_size 4 \
  --grad_accum_steps 8 \
  --epochs 20 \
  --save_path checkpoints/run1 \
  --seed 141 \
  --use_maxsim \
  --use_ln_g1

Training data

https://d-script.readthedocs.io/en/stable/data.html

Input Format

Pair file (CSV) A CSV file containing protein identifier pairs:

query,text
P12345,Q99999
P12345,Q88888

FASTA file Protein identifiers must match IDs used in the pair file.

Inference

Compute embeddings and infer interactions:

CUDA_VISIBLE_DEVICES=1,2 torchrun --nproc_per_node=2 \
script/inference_HF.py \
  --pair_csv xspecies_ID_test/yeast_test.csv \
  --fasta_path xspecies_ID_test/xspecies_fasta/yeast_dedup.fasta \
  --hf_repo netbiolab/LoGoBERT-PPI-Eukaryote \
  --model_name facebook/esm2_t33_650M_UR50D \
  --max_length 800 \
  --embedding_save_path embeddings.pt \
  --output_path pair_scores.csv

Use cached embeddings

CUDA_VISIBLE_DEVICES=1,2 torchrun --nproc_per_node=2 \
script/inference_HF.py \
  --pair_csv xspecies_ID_test/yeast_test.csv \
  --embeddings_path embeddings.pt \
  --hf_repo netbiolab/LoGoBERT-PPI-Eukaryote \
  --max_length 800 \
  --output_path pair_scores.csv

Embeddings are computed once and reused during interaction scoring to enable efficient large-scale inference.

Acknowledgements

Parts of the batching utilities were adapted from PLM-interact (MIT License, Dan Liu, 2024). See the NOTICE file for details.

License

This project is released under the Apache License 2.0.

Some components were adapted from PLM-interact (MIT License, Dan Liu, 2024).

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
logobert		logobert
script		script
test_train_data/train		test_train_data/train
xspecies_ID_test		xspecies_ID_test
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LoGoBERT-PPI

Overview

Installation

Requirements

Model Weights

Training

Training data

Input Format

Inference

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LoGoBERT-PPI

Overview

Installation

Requirements

Model Weights

Training

Training data

Input Format

Inference

Acknowledgements

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages