BAIO (Bioinformatics AI for Open-set detection) is a cutting-edge metagenomic analysis platform that leverages foundation models for DNA sequence analysis. Unlike traditional reference-based pipelines (e.g., Kraken2, MetaPhlAn), BAIO can detect novel and divergent pathogens through open-set recognition methods, making it crucial for pandemic preparedness and surveillance.
- Taxonomy Profiling: Classifies sequencing reads or contigs into known classes (viral families, bacterial groups, host)
- Open-Set Detection: Flags sequences that don't fit known classes as "novel/unknown"
- Sample-Level Reporting: Aggregates read-level predictions into intuitive reports for surveillance
- Interpretability: Visualizes embeddings, attention patterns, and clusters of novel reads
- End-to-End Pipeline: FASTQ ā Evo2 embeddings ā classifier + OOD detection ā GUI report (JSON + PDF)
- Streamlit: Simple, fast GUI with support for file upload, plots, and dashboards
- PyTorch + Hugging Face Transformers: Core ML framework
- Evo2: Nucleotide model for embeddings
- Custom Heads: MLP for taxonomy classification; OOD scores (Max Softmax, Energy, Mahalanobis)
- Biopython: FASTA/FASTQ parsing
- HDBSCAN: Clustering novel reads
- NumPy/Pandas: Embedding manipulation
- Scikit-learn: Model calibration and evaluation
- GitHub Actions: CI/CD pipeline
- pytest: Testing framework
- Conda/Poetry: Environment management
metaseq-detector/
āā app/ # Streamlit GUI
ā āā streamlit_app.py
āā metaseq/ # Core library
ā āā dataio.py # FASTA/FASTQ loaders, filters
ā āā evo2_embed.py # Evo2 embedding wrapper
ā āā models.py # Classifier heads
ā āā ood.py # MSP/Energy/Mahalanobis
ā āā agg.py # Sample-level aggregation
ā āā cluster.py # HDBSCAN for OOD reads
ā āā viz.py # Plots: ROC, UMAP, attention
āā configs/ # YAMLs for experiments
āā notebooks/ # Exploratory notebooks
āā tests/ # Pytest unit tests
āā runs/ # Saved reports/metrics
āā weights/ # Trained classifier heads
āā examples/ # Demo FASTQ/FASTA
āā environment.yml
āā pyproject.toml
āā docs/
āā weekly_report.md
āā design.md # System architecture
āā dataset_card.md # Data sources and splits
- Python 3.8+
- CUDA-compatible GPU (recommended for Evo2 model)
# Clone the repository
git clone https://github.com/your-org/baio.git
cd baio
# Create and activate conda environment
conda env create -f environment.yml
conda activate baioIf you prefer using Python's built-in virtual environment instead of conda:
Windows:
# Navigate to project directory
cd baio
# Create virtual environment
python -m venv baio-env
# Alternative if python3 is your command
python3 -m venv baio-envmacOS/Linux:
# Navigate to project directory
cd baio
# Create virtual environment
python3 -m venv baio-envWindows (Command Prompt):
baio-env\Scripts\activateWindows (PowerShell):
baio-env\Scripts\Activate.ps1macOS/Linux:
source baio-env/bin/activate# Upgrade pip first
pip install --upgrade pip
# Install project dependencies
pip install -r requirements.txt
pip install streamlit torch transformers biopython numpy pandas scikit-learn hdbscan plotly# Check Python version
python --version
# Check installed packages
pip list
# Test Streamlit installation
streamlit hello# Clone and install with Poetry
git clone https://github.com/your-org/baio.git
cd baio
poetry install- Open Command Palette (
Ctrl+Shift+PorCmd+Shift+P) - Select "Python: Select Interpreter"
- Choose the Python executable from your virtual environment:
- Conda: Usually in
~/miniconda3/envs/baio/bin/python - venv:
baio-env/Scripts/python.exe(Windows) orbaio-env/bin/python(macOS/Linux)
- Conda: Usually in
- Go to File ā Settings ā Project ā Python Interpreter
- Click gear icon ā Add
- Select "Existing Environment"
- Browse to your environment's Python executable
# For conda environment
conda activate baio
conda install ipykernel
python -m ipykernel install --user --name=baio --display-name="BAIO Project"
# For venv environment
source baio-env/bin/activate # or activate script for Windows
pip install ipykernel
python -m ipykernel install --user --name=baio-env --display-name="BAIO Project"Create a .env file in your project root for API keys and configuration:
# .env file
OPENAI_API_KEY=your_api_key_here
HUGGINGFACE_TOKEN=your_hf_token_here
DEBUG=True
STREAMLIT_SERVER_PORT=8501
CUDA_VISIBLE_DEVICES=0Load environment variables in your Python code:
from dotenv import load_dotenv
import os
load_dotenv()
api_key = os.getenv('OPENAI_API_KEY')Solution: Use python3 instead of python, or ensure Python is in your system PATH.
Solution: Run PowerShell as administrator and execute:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUserSolution:
# Use python -m pip instead
python -m pip install package-nameSolution:
# Check CUDA availability
python -c "import torch; print(torch.cuda.is_available())"
# Install CPU-only version if needed
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpuSolution:
- Ensure stable internet connection
- Check Hugging Face token permissions
- Use
huggingface-cli loginfor authentication
-
Activate Environment:
# Conda conda activate baio # venv (Linux/Mac) source baio-env/bin/activate # venv (Windows) baio-env\Scripts\activate
-
Update Dependencies:
# Conda conda env update -f environment.yml # venv pip install -r requirements.txt
-
Run Development Server:
streamlit run app/streamlit_app.py
-
Deactivate When Done:
# Both conda and venv deactivate
For Conda:
conda activate baio
conda install new-package
conda env export > environment.ymlFor venv:
source baio-env/bin/activate # Windows: baio-env\Scripts\activate
pip install new-package
pip freeze > requirements.txtFor Poetry:
poetry add new-package# Using conda
conda activate baio
streamlit run app/streamlit_app.py
# Using venv
source baio-env/bin/activate # Windows: baio-env\Scripts\activate
streamlit run app/streamlit_app.py
# Using Poetry
poetry run streamlit run app/streamlit_app.pyYou can run the full BAIO stackāFastAPI backend and Streamlit UIāwithout installing Python or dependencies locally.
- Docker Desktop or Docker Engine with Compose v2.
- Git to clone the repository.
Copy the template and adjust if needed:
docker compose build docker compose up
- Upload your FASTQ/FASTA files through the Streamlit interface
- Configure analysis parameters (sequence length filters, confidence thresholds)
- Run the analysis pipeline
- View results including:
- Taxonomy classifications
- Novel sequence detection
- Embedding visualizations
- Sample-level reports
- Accuracy, macro-F1
- Per-class Precision/Recall
- AUROC, AUPR-Out
- FPR@95%TPR
- OSCR matrix
- Confusion across {Known-Correct, Known-Wrong, Unknown-Correct, Unknown-Wrong}
- Reads/second processing rate
- Memory footprint
# Activate your environment first
conda activate baio # or source baio-env/bin/activate
# Run tests
pytest tests/# Format code
black metaseq/ app/
# Lint code
flake8 metaseq/ app/
# Type checking
mypy metaseq/ app/We welcome contributions! Please see our contributing guidelines and code of conduct.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Set up your development environment (see Installation section)
- Make your changes and test thoroughly
- Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
BAIO is designed for:
- Pandemic Preparedness: Early detection of novel pathogens
- Metagenomic Surveillance: Monitoring environmental and clinical samples
- Research: Comparative analysis with traditional methods like Kraken2
- Clinical Diagnostics: Supporting pathogen identification in complex samples
If you use BAIO in your research, please cite:
@software{baio2024,
title={BAIO: LLM-Based Taxonomy Profiling & Open-Set Pathogen Detection},
author={Farhan, Tanzim and Hashami, Mustafa and Gujja, Sahana and Burns, Eric and Gaikwad, Manali},
year={2025},
url={https://github.com/your-org/baio}
}This project is licensed under the MIT License - see the LICENSE file for details.
- Tanzim Farhan - Tech Lead
- Mustafa Hashami - Developer
- Sahana Gujja - Developer
- Eric Burns - Developer
- Manali Gaikwad - Developer
- Built on the Evo2 foundation model (Science, 2024)
- Inspired by the need for improved metagenomic surveillance capabilities
- Thanks to the open-source bioinformatics community
For questions, issues, or contributions, please:
- Open an issue on GitHub
- Contact the development team
- Check our documentation in the
docs/directory
Note: This is a research prototype. For production use in clinical settings, additional validation and regulatory approval may be required.