A machine learning model to predict which chemical compounds can fight HIV effectively, helping researchers focus on the most promising candidates.
This project uses QSAR (Quantitative Structure-Activity Relationship) modeling to predict HIV drug compound efficacy. It helps pharmaceutical researchers identify promising compounds before expensive lab testing.
Key Benefits:
- Reduce screening time from months to hours
 - Focus resources on high-probability compounds
 - Improve success rates in drug discovery
 
- Complete ML pipeline from data preprocessing to deployment
 - REST API for real-time predictions
 - Model monitoring and performance tracking
 - Batch processing for large compound libraries
 - Docker containerization for easy deployment
 
Source: NCI AIDS Antiviral Screen Data
- 40,000+ HIV-tested compounds
 - Activity classes: CA (Active), CM (Moderately Active), CI (Inactive)
 - EC50/IC50 measurements
 - Molecular structure data
 
- Language: Python 3.8+
 - ML Libraries: scikit-learn, XGBoost, RDKit
 - API: FastAPI
 - Database: PostgreSQL
 - MLOps: MLflow
 - Deployment: Docker
 
| Metric | Value | 
|---|---|
| Accuracy | 87.3% | 
| F1-Score | 0.84 | 
| Cohen's Kappa | 0.79 | 
| AUC-ROC | 0.91 | 
# Clone repository
git clone https://github.com/pari1jay/Prediction-Model-HC.git
cd Prediction-Model-HC
# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# venv\Scripts\activate   # Windows
# Install dependencies
pip install -r requirements.txt# Development mode
python src/main.py
# Production with Docker
docker-compose up -dimport requests
# Predict compound activity
response = requests.post(
    "http://localhost:8000/predict",
    json={"smiles": "CC(C)CC1=CC=C(C=C1)C(C)C(=O)O"}
)
print(response.json())from src.training.pipeline import TrainingPipeline
pipeline = TrainingPipeline()
pipeline.load_data("data/hiv_compounds.csv")
pipeline.preprocess()
pipeline.train()
pipeline.evaluate()from src.prediction.predictor import CompoundPredictor
predictor = CompoundPredictor.load("models/best_model.pkl")
result = predictor.predict_smiles("CCO")
print(f"Activity: {result['activity']}, Confidence: {result['confidence']:.3f}")python scripts/batch_predict.py --input compounds.csv --output predictions.csvPrediction-Model-HC/
├── src/
│   ├── data/           # Data processing
│   ├── features/       # Feature engineering
│   ├── models/         # ML models
│   ├── training/       # Training pipeline
│   ├── prediction/     # Prediction service
│   └── api/           # REST API
├── data/              # Datasets
├── models/            # Trained models
├── scripts/           # Utility scripts
├── tests/             # Test files
└── docs/              # Documentation
- Data Integration: Merge screening results, EC50/IC50 values, and molecular structures
 - Quality Control: Handle duplicates, conflicts, and missing data
 - Feature Engineering: Calculate molecular descriptors and fingerprints
 - Model Training: Train and validate multiple ML models
 - Evaluation: Assess model performance using relevant metrics
 
- Fork the repository
 - Create a feature branch (
git checkout -b feature/new-feature) - Commit your changes (
git commit -m 'Add new feature') - Push to the branch (
git push origin feature/new-feature) - Open a Pull Request
 
This project is licensed under the MIT License - see the LICENSE file for details.
- National Cancer Institute for the AIDS Antiviral Screen Data
 - RDKit community for cheminformatics tools
 - Open source contributors
 
Pari Jay - GitHub
Project Link: https://github.com/pari1jay/Prediction-Model-HC