AI-powered content moderation with explainable decisions and continuous learning.
Most moderation tools rely on keyword lists that are easy to bypass, or black-box AI that offers no transparency. AntiBully combines hybrid ML, explainable AI (LIME), and a human-in-the-loop feedback system that improves the model over time.
Key capabilities:
- Hybrid ML β DistilBERT embeddings + XGBoost, enriched with user context (violation history, channel toxicity) from a Redis feature store
- Explainability β LIME highlights which words triggered a flag; users see exactly why they were moderated
- Feedback loop β Users dispute false positives β admins review β corrected labels retrain the model monthly
- Drift detection β Evidently monitors weekly; automated retraining triggers on significant drift or monthly schedule
- Multi-platform β One ML backend serving Discord, Slack, and WhatsApp
<<<<<<< HEAD
AntiBully Bot is an intelligent content moderation system that combines:
- Hybrid ML Model: DistilBERT (text embeddings) + XGBoost (user context features)
- Explainable AI: LIME-generated explanations for every decision
- Admin Feedback Loop: Human-in-the-loop corrections improve model accuracy
- Multi-Platform: Single ML backend serves Discord, Slack, WhatsApp bots
- Production MLOps: Automated drift detection, retraining, and deployment
Traditional moderation tools either:
- Use simple keyword filters (easy to bypass)
- Use black-box AI (no transparency)
- Can't learn from mistakes (frozen models)
AntiBully solves all three by combining state-of-the-art NLP, explainable AI, and continuous learning.
- Context-Aware: Uses user history (violation rate, tenure) + channel toxicity
- Multi-Level Severity: LOW/MEDIUM/HIGH classification with configurable actions
- Strike System: Graduated penalties (warn β timeout β kick β ban)
- Configurable: Per-server thresholds, actions, and message templates
- LIME Integration: Shows which words contributed to toxicity score
- User Dashboard: Users can see why they were flagged via
!explaincommand - Admin Dashboard: Admins review uncertain cases with full context
- Dispute System: Users dispute β admin reviews β model learns
- Drift Detection: Evidently monitors data/model drift weekly
- Automated Retraining: Triggers on high drift or monthly schedule
- Feature Store: Redis caches user features for <5ms inference
- Versioning: DVC tracks data, MLflow tracks models
- CI/CD Ready: Docker containers, Railway.app deployment
- Double-Gated Feedback: User disputes require admin approval
- Spam Protection: Detects coordinated attacks, flooding, repetition
- Protected Patterns: Slurs never overridden by user feedback
- Audit Trail: Every action logged with timestamps and admin IDs
=======
33177d6 (Updated readme)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β USER INTERFACES β
β Discord Bot β Slack Bot β WhatsApp Bot β Telegram (WIP) β
ββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β WebSocket/REST
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BOT SERVICE LAYER β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Moderation β β Admin β β Feedback β β
β β (on_message)β β Commands β β (!explain) β β
β ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ β
βββββββββββΌββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββββ
β POST /predict β GET /config β POST /explain
βΌ βΌ βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INFERENCE API (FastAPI) β
β ββββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ β
β β Toxicity Detectorβ β LIME Explainer β β Feature Enricherβ β
β β DistilBERT+XGB β β (word importance)β β (Redis lookup) β β
β ββββββββ¬ββββββββββββ ββββββββββββββββββββ ββββββββββ¬βββββββββ β
βββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββ
β log_event() β get_features()
βΌ βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA LAYER β
β βββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββ
β β Supabase PostgreSQL β β Redis (Feature Store) ββ
β β ββ logs β β ββ user_toxicity:prod:{id} ββ
β β ββ server_configs β β ββ channel_stats:{id} ββ
β β ββ server_user_violations β β ββ (5ms lookups) ββ
β β ββ feedback (disputes) β βββββββββββββββββββββββββββββββββ
β β ββ admin_review_queue β β
β βββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ
β Nightly Sync
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MLOPS ORCHESTRATOR (Prefect) β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β STEP 1: Drift Detection β β
β β KS-test + PSI (statistical) + Evidently report β β
β β β severity: none / low / medium / high / critical β β
β βββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β STEP 2: Data Ingestion β β
β β Stratified sample (5% anchors) + 100% admin feedback β β
β β Quality checks β merge β DVC push to S3 (Parquet) β β
β β β status: success / skipped / failed (aborts pipeline) β β
β βββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β STEP 3: Retraining Decision β β
β β high/critical drift β trigger main_flow() immediately β β
β β medium drift β flag for next scheduled run β β
β β low/none β skip, model is stable β β
β β Experiments tracked in MLflow (DagsHub) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Input: "you are trash" + user_id + channel_id
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 1: TEXT FEATURES β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β DistilBERT Embeddings (768 dims) β
β + Static Features: β
β ββ msg_len, caps_ratio, slur_count β
β ββ personal_pronoun_count, question_count β
β ββ char_repetition, exclamation_count β
βββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β (768 + 15 features)
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 2: USER CONTEXT ENRICHMENT (Redis) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β ββ user_bad_ratio_7d (% toxic messages) β
β ββ violation_count_7d β
β ββ channel_toxicity_ratio β
β ββ hours_since_last_msg β
β ββ is_new_to_channel β
βββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β (768 + 15 + 5 = 788 features)
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 3: XGBoost Classifier β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Output: P(toxic) β [0, 1] β
β ββ < 0.3 β SAFE β
β ββ 0.3β0.5 β LOW β
β ββ 0.5β0.7 β MEDIUM β
β ββ > 0.7 β HIGH β
βββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
βΌ
{ "is_toxic": true, "confidence": 0.85, "severity": "HIGH" }
1. Model flags message β User gets DM with LIME explanation
2. User clicks β Wrong β submits dispute reason
3. Admin runs /review β approves or overrides model decision
4. Month-end: corrected labels + high-confidence anchors retrain model
5. New model deployed β false positive rate drops
Admins can review disputes one-by-one or bulk-approve by user pattern β useful when the model is systematically misclassifying gaming slang, regional expressions, etc.
- Python 3.11+
- Node.js 18+ (for some ML tooling)
- PostgreSQL 15+ or a Supabase account
- Redis 7+
- Discord bot token
- S3-compatible storage (AWS S3, MinIO, or Supabase Storage)
- MLflow tracking server (free via DagsHub)
git clone https://github.com/yourusername/antibully-bot.git
cd antibully-bot
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txtcp .env.example .envEdit .env:
# Database
DATABASE_URL=postgresql://user:pass@host:5432/db
# Redis
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_PASSWORD= # leave blank if none
# Discord
DISCORD_TOKEN=your_bot_token
DISCORD_APPLICATION_ID=your_app_id
# API
API_BASE_URL=http://localhost:8000
# MLOps
MLFLOW_TRACKING_URI=https://dagshub.com/user/repo.mlflow
S3_BUCKET=your-bucket
AWS_ACCESS_KEY_ID=your_key
AWS_SECRET_ACCESS_KEY=your_secret
# Model
MODEL_LOCAL_PATH=./baked_model
EXPERIMENT_NAME=toxicity-detector
STAGE=Productionpython scripts/run_migrations.pyThis creates the logs, server_configs, server_user_violations, and feedback tables, plus indexes.
Open three terminals:
# Terminal 1 β Inference API
uvicorn api_service.app:app --reload --port 8000
# Terminal 2 β Discord Bot
python -m bot_service.bot
# Terminal 3 β Redis (if running locally)
redis-serverhttps://discord.com/api/oauth2/authorize?client_id=YOUR_APP_ID&permissions=1099780063238&scope=bot%20applications.commands
In your Discord server, run:
/config
This opens an interactive menu to set strike thresholds, timeout durations, severity actions, log channels, and more. To use safe defaults immediately:
/quickset preset:balanced
/review # Work through dispute queue
/review view:stats # See model accuracy and review counts
/review view:by_user # Group disputes by user for bulk review
/review_user user:@name # Bulk approve/override all disputes from one user
/pardon user:@name reason:"false positive"
/strikes user:@name # View violation history!explain # Get a DM showing which words triggered the flag, with a dispute buttonEstimated cost: ~$19/month. Setup time: ~30 minutes.
- Create an account at railway.app and connect your GitHub repo.
- Create four services:
| Service | Dockerfile | Start Command | Notes |
|---|---|---|---|
| Discord Bot | Dockerfile.bot |
python -m bot_service.bot |
512 MB RAM |
| Inference API | Dockerfile.api |
uvicorn api_service.app:app |
2 GB RAM, expose port 8000 |
| Redis | Plugin | β | 1-click provision |
| MLOps Worker | Dockerfile.mlops |
β | Cron: 0 3 * * * |
- Add environment variables to each service.
- Push to
mainβ Railway auto-deploys.
| Endpoint | Method | Description | Latency |
|---|---|---|---|
/predict |
POST | Classify message toxicity | 50β100ms |
/explain |
POST | Generate LIME word importance | 2β4s |
/feedback |
POST | Record user dispute | <50ms |
/health |
GET | Service status | <10ms |
POST /predict example:
// Request
{ "text": "you are trash", "user_id": "u123", "channel_id": "c456" }
// Response
{ "is_toxic": true, "confidence": 0.85, "severity": "HIGH", "features_used": { ... } }- Discord bot with LIME explanations
- Admin feedback loop and review dashboard
- Automated MLOps pipeline with drift detection
- Slack integration
- WhatsApp integration
- Multi-language support
- Mobile admin app
DistilBERT Β· XGBoost Β· LIME Β· FastAPI Β· discord.py Β· Supabase Β· Redis Β· MLflow Β· DVC Β· Evidently Β· Prefect Β· Railway.app
MIT License Β· Built for safer online communities β
33177d6 (Updated readme)