-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Problem
Running 1,200 simulations (8 reforms × 75 years x static/dynamic) takes ~200 hours sequentially. Each simulation takes ~20 minutes to calculate baseline and reform income tax totals.
Goal: Complete in 2-3 hours using Google Cloud Batch
Solution
Two-phase parallel approach:
- Phase 1: Compute 75 baseline totals (one per year) → 2 minutes
- Phase 2: Compute 1,200 reform totals using cached baselines → 2 hours
Time: ~2 hours | Cost: ~$4/run
Implementation
Prerequisites (Already Complete ✅)
- Google Cloud CLI installed
- Authenticated with
[email protected] - Project:
policyengine-apiwith billing enabled - APIs enabled: Batch, Compute, Storage, Artifact Registry
1. Create Cloud Storage Bucket
gsutil mb -l us-central1 gs://crfb-ss-analysis-results/- Create Files in batch/ Directory
Core files:
- compute_baseline.py - Worker: calculates baseline for one year, saves to bucket
- compute_reform.py - Worker: downloads baseline, calculates reform, saves result
- submit_baselines.py - Submits 75 baseline jobs (75 parallel workers)
- submit_reforms.py - Submits 1,200 reform jobs (200 parallel workers)
- download_results.py - Combines all results into data/policy_impacts_dynamic.csv
- Dockerfile - Container with Python 3.13 + PolicyEngine + dependencies
- requirements.txt - policyengine-us, policyengine-core, pandas, google-cloud-storage
Result format:
{
"reform_id": "option1",
"reform_name": "Full Repeal...",
"year": 2026,
"baseline_total": 1234567890,
"reform_total": 1111111111,
"revenue_impact": -123456779,
"scoring_type": "dynamic"
}
File Structure
crfb-tob-impacts/
├── batch/
│ ├── compute_baseline.py # Phase 1 worker
│ ├── compute_reform.py # Phase 2 worker
│ ├── submit_baselines.py # Phase 1 submission
│ ├── submit_reforms.py # Phase 2 submission
│ ├── download_results.py # Result aggregation
│ ├── Dockerfile # Container definition
│ └── requirements.txt # Python dependencies
├── src/
│ └── reforms.py # Existing reform definitions
└── data/
└── policy_impacts_dynamic.csv # Final output
Cloud Storage:
gs://crfb-ss-analysis-results/
├── baselines/
│ ├── 2026.json → {"year": 2026, "baseline_total": 1234567890}
│ ├── 2027.json
│ └── ... (75 files, ~10KB total)
└── results/
└── reform-abc123/
├── option1_2026.json
└── ... (1,200 files, ~50KB total)
- Build Docker Container
gcloud config set project policyengine-api
gcloud builds submit --tag gcr.io/policyengine-api/ss-calculator:latest batch/
- Test (2 years, 2 reforms = 4 simulations)
python batch/submit_baselines.py --years 2026,2027
python batch/submit_reforms.py --reforms option1,option2 --years 2026,2027
python batch/download_results.py --job-id [JOB_ID]
- Run Full Job
python batch/submit_baselines.py --years 2026-2100 # 2 mins
python batch/submit_reforms.py --years 2026-2100 # 2 hours
python batch/download_results.py --job-id [JOB_ID] # saves CSV with 1,200 rows
File Structure
batch/
├── compute_baseline.py
├── compute_reform.py
├── submit_baselines.py
├── submit_reforms.py
├── download_results.py
├── Dockerfile
└── requirements.txt
Cloud Storage:
gs://crfb-ss-analysis-results/
├── baselines/{year}.json # 75 files
└── results/{job_id}/{reform}_{year}.json # 1,200 files
Success Criteria
- Test run (4 sims) completes and matches notebook results
- Full run (1,200 sims) completes in < 3 hours
- Final CSV has 1,200 rows with columns: reform_id, reform_name, year, baseline_total, reform_total, revenue_impact, scoring_type
Technical Details
- Uses CBO labor supply elasticities for dynamic scoring
- Spot instances for cost savings ($0.01/hr per VM)
- Phase 1 avoids computing each baseline 8× (once per reform)
- All logic matches jupyterbook/policy-impacts-dynamic.ipynb