Constitutional AI finetuning using contemplative principles on existing open-source models. This project applies Anthropic's Constitutional AI framework to align pre-trained models (like QWEN) with contemplative wisdom traditions.
This repository implements direct Constitutional AI finetuning on existing models to improve alignment with contemplative principles:
- Emptiness - Understanding interdependence and avoiding conceptual rigidity
- Non-duality - Recognizing unity while maintaining practical distinctions
- Boundless Care - Universal compassion and concern for all beings
- Mindfulness - Present-moment awareness and clear discernment
- AILuminate Integration: 1,290 adversarial prompts from MLCommons benchmark (git submodule)
- Train/Test Split Management: Persistent, reproducible splits across all experiments
- Flexible Filtering: By hazard category (14 types) and persona type (3 types)
- Multiple Model Support: QWEN 2.5 (7B-32B), Llama, Mistral
- Apple Silicon Optimized: MPS acceleration for local development
We start with pre-trained models (QWEN, Llama, Mistral) and apply the Constitutional AI process directly:
- Load Adversarial Prompts: From AILuminate benchmark (designed to elicit unsafe responses)
- Generate Baseline Responses: Model responds to adversarial prompts
- Constitutional Critique: Model critiques its responses using contemplative principles
- Generate Revisions: Model revises responses to align with principles
- Create Preference Pairs: Original (rejected) vs. Revised (chosen)
- Train with DPO: Direct Preference Optimization on preference pairs
- Evaluate: Test on safety benchmarks and contemplative metrics
No separate supervised learning phase needed - we leverage the existing instruction-following capabilities of pre-trained models.
# 1. Clone repository
git clone https://github.com/yourusername/contemplative_constitutional_ai.git
cd contemplative_constitutional_ai
# 2. Initialize AILuminate submodule
git submodule update --init --recursive
# 3. Install dependencies
pip install -r requirements.txt
# 4. Verify setup
python scripts/smoke_test.py# Generate 100 preference pairs from AILuminate dataset with train/test split
python scripts/generate_cai_data.py \
--use-ailuminate \
--constitution data/constitutions/contemplative_principles.md \
--model qwen2_7b \
--max-prompts 100 \
--hazard-categories vcr cse hte ssh \
--create-split \
--test-size 0.1 \
--output results/ailuminate_pairs.jsonl \
--device mps # or cuda for GPUs
# This creates:
# - results/ailuminate_pairs.jsonl (400 preference pairs: 100 prompts × 4 principles)
# - data/splits/default_split.json (train/test split configuration)# Train using the split configuration
python scripts/train_dpo.py \
--dataset results/ailuminate_pairs.jsonl \
--base-model qwen2_7b \
--use-split-config \
--output models/qwen-7b-contemplative \
--epochs 3 \
--device mps # or cuda for GPUs
# The trainer automatically:
# - Loads the split configuration
# - Trains on training set
# - Evaluates on test set# Filter by specific hazard categories
python scripts/generate_cai_data.py \
--use-ailuminate \
--hazard-categories vcr cse ssh \
--persona-types skilled \
--constitution data/constitutions/contemplative_principles.md \
--model qwen2_7b \
--create-split \
--output results/physical_hazards.jsonl
# Generate only training split
python scripts/generate_cai_data.py \
--use-ailuminate \
--split-only train \
--split-config data/splits/default_split.json \
--constitution data/constitutions/contemplative_principles.md \
--model qwen2_7b \
--output results/train_pairs.jsonl- QWEN 2.5 (0.5B, 1.5B, 7B, 14B, 32B) - Primary focus
- 0.5B/1.5B for local PoC on MacBook M2
- 7B+ for production quality
- Llama 3.1/3.2 models
- Mistral models
- Other instruction-tuned models
- Generate responses from base model
- Critique using contemplative principles
- Revise responses based on critiques
- Create preference pairs (original vs revised)
- Train with DPO/PPO on preference data
- Responses should reflect interdependence (emptiness)
- Avoid reinforcing harmful dualistic thinking (non-duality)
- Show genuine care for all beings (boundless care)
- Encourage present-moment clarity (mindfulness)
AILuminate Demo Dataset (included as submodule):
- 1,290 prompts across 14 hazard categories
- 1,290 × 4 principles = 5,160 preference pairs
- Phase 0 (PoC): 500-1K pairs ✅ SUFFICIENT
- Phase 1 (Dev): 5K-10K pairs ✅ SUFFICIENT
AILuminate Practice Dataset (requires MLCommons membership):
- 12,000 prompts
- 12,000 × 4 = 48,000 preference pairs
- Phase 2+ (Production): 40K+ pairs ✅ SUFFICIENT
Anthropic HH-RLHF (alternative/supplementary):
- 160,000 conversations
- See
IMPLEMENTATION_PLAN.mdfor details
Compare finetuned models against base models on:
- Safety Benchmarks: AILuminate (24K prompts, 14 hazard categories)
- Contemplative Metrics: Custom evaluations for 4 contemplative principles
- Capability Benchmarks: MT-Bench, MMLU, HumanEval
- General Capabilities: Ensure no degradation in core abilities
Evaluation using AILuminate includes:
- Harmfulness assessment (35% weight)
- Refusal clarity (25% weight)
- Bias mitigation (20% weight)
- Uncertainty acknowledgment (20% weight)
├── src/
│ ├── cai/
│ │ └── pipeline.py # Constitutional AI pipeline
│ ├── constitutional/
│ │ └── config_parser.py # Parse contemplative principles
│ ├── data/
│ │ ├── ailuminate_loader.py # AILuminate dataset loader
│ │ └── split_manager.py # Train/test split management
│ ├── models/
│ │ └── model_loader.py # Model loading utilities
│ └── training/
│ └── dpo_trainer.py # DPO training implementation
├── scripts/
│ ├── generate_cai_data.py # Generate preference pairs
│ ├── train_dpo.py # Train with DPO
│ └── smoke_test.py # Environment validation
├── data/
│ ├── constitutions/ # Contemplative principles
│ ├── benchmarks/
│ │ └── ailuminate/ # AILuminate submodule
│ └── splits/ # Train/test split configs
├── configs/
│ ├── model_configs.yaml # Model specifications
│ └── training_configs.yaml # Training parameters
├── docs/
│ ├── AILUMINATE_INTEGRATION.md # Integration details
│ └── AILUMINATE_USAGE.md # Usage guide
├── results/ # Experimental results
├── DATA_PIPELINE.md # Detailed data pipeline
├── EVALUATION_METRICS.md # Evaluation methodology
├── IMPLEMENTATION_PLAN.md # Phase-by-phase development
├── HARDWARE_REQUIREMENTS.md # Hardware specifications
├── PROJECT_STATUS.md # Current status and next steps
└── DESIGN.md # Technical design
- DATA_PIPELINE.md - Data sources, AILuminate integration, train/test splits
- AILUMINATE_USAGE.md - Complete AILuminate usage guide
- IMPLEMENTATION_PLAN.md - Phase-by-phase development plan
- PROJECT_STATUS.md - Current status and priorities
- DESIGN.md - Technical architecture
- EVALUATION_METRICS.md - Evaluation framework
- HARDWARE_REQUIREMENTS.md - Hardware specifications
This implements the contemplative principles from "Contemplative Alignment" (arXiv:2504.15125), extending the prompting experiments to full constitutional finetuning.
Our implementation follows the proven Constitutional AI methodology from Hugging Face's Constitutional AI with Open LLMs, adapting their scalable approach for contemplative principles.
Dataset: Uses MLCommons AILuminate v1.0 benchmark for adversarial prompts and safety evaluation.