Implement Google Cloud Batch for parallel policy impact calculations

  ## Problem

  Running 1,200 simulations (8 reforms × 75 years x static/dynamic) takes ~200 hours sequentially. Each simulation takes ~20 minutes to calculate baseline and reform income tax totals.

  **Goal:** Complete in 2-3 hours using Google Cloud Batch

  ## Solution

  Two-phase parallel approach:
  1. **Phase 1:** Compute 75 baseline totals (one per year) → 2 minutes
  2. **Phase 2:** Compute 1,200 reform totals using cached baselines → 2 hours

  **Time:** ~2 hours | **Cost:** ~$4/run

  ## Implementation

  ### Prerequisites (Already Complete ✅)
  - [x] Google Cloud CLI installed
  - [x] Authenticated with `pavel@policyengine.org`
  - [x] Project: `policyengine-api` with billing enabled
  - [x] APIs enabled: Batch, Compute, Storage, Artifact Registry

  ### 1. Create Cloud Storage Bucket
  ```bash
  gsutil mb -l us-central1 gs://crfb-ss-analysis-results/
```

  2. Create Files in batch/ Directory

  Core files:
  - compute_baseline.py - Worker: calculates baseline for one year, saves to bucket
  - compute_reform.py - Worker: downloads baseline, calculates reform, saves result
  - submit_baselines.py - Submits 75 baseline jobs (75 parallel workers)
  - submit_reforms.py - Submits 1,200 reform jobs (200 parallel workers)
  - download_results.py - Combines all results into data/policy_impacts_dynamic.csv
  - Dockerfile - Container with Python 3.13 + PolicyEngine + dependencies
  - requirements.txt - policyengine-us, policyengine-core, pandas, google-cloud-storage

Result format:
  {
    "reform_id": "option1",
    "reform_name": "Full Repeal...",
    "year": 2026,
    "baseline_total": 1234567890,
    "reform_total": 1111111111,
    "revenue_impact": -123456779,
    "scoring_type": "dynamic"
  }

  File Structure

  crfb-tob-impacts/
  ├── batch/
  │   ├── compute_baseline.py       # Phase 1 worker
  │   ├── compute_reform.py         # Phase 2 worker
  │   ├── submit_baselines.py       # Phase 1 submission
  │   ├── submit_reforms.py         # Phase 2 submission
  │   ├── download_results.py       # Result aggregation
  │   ├── Dockerfile                # Container definition
  │   └── requirements.txt          # Python dependencies
  ├── src/
  │   └── reforms.py                # Existing reform definitions
  └── data/
      └── policy_impacts_dynamic.csv  # Final output

  Cloud Storage:
  gs://crfb-ss-analysis-results/
  ├── baselines/
  │   ├── 2026.json → {"year": 2026, "baseline_total": 1234567890}
  │   ├── 2027.json
  │   └── ... (75 files, ~10KB total)
  └── results/
      └── reform-abc123/
          ├── option1_2026.json
          └── ... (1,200 files, ~50KB total)


  3. Build Docker Container

  gcloud config set project policyengine-api
  gcloud builds submit --tag gcr.io/policyengine-api/ss-calculator:latest batch/

  4. Test (2 years, 2 reforms = 4 simulations)

  python batch/submit_baselines.py --years 2026,2027
  python batch/submit_reforms.py --reforms option1,option2 --years 2026,2027
  python batch/download_results.py --job-id [JOB_ID]

  5. Run Full Job

  python batch/submit_baselines.py --years 2026-2100  # 2 mins
  python batch/submit_reforms.py --years 2026-2100     # 2 hours
  python batch/download_results.py --job-id [JOB_ID]   # saves CSV with 1,200 rows

  File Structure

  batch/
  ├── compute_baseline.py
  ├── compute_reform.py
  ├── submit_baselines.py
  ├── submit_reforms.py
  ├── download_results.py
  ├── Dockerfile
  └── requirements.txt

  Cloud Storage:
  gs://crfb-ss-analysis-results/
  ├── baselines/{year}.json           # 75 files
  └── results/{job_id}/{reform}_{year}.json  # 1,200 files

  Success Criteria

  - Test run (4 sims) completes and matches notebook results
  - Full run (1,200 sims) completes in < 3 hours
  - Final CSV has 1,200 rows with columns: reform_id, reform_name, year, baseline_total, reform_total, revenue_impact, scoring_type

  Technical Details

  - Uses CBO labor supply elasticities for dynamic scoring
  - Spot instances for cost savings ($0.01/hr per VM)
  - Phase 1 avoids computing each baseline 8× (once per reform)
  - All logic matches jupyterbook/policy-impacts-dynamic.ipynb

  ---

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement Google Cloud Batch for parallel policy impact calculations #28

Problem

Solution

Implementation

Prerequisites (Already Complete ✅)

1. Create Cloud Storage Bucket

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement Google Cloud Batch for parallel policy impact calculations #28

Description

Problem

Solution

Implementation

Prerequisites (Already Complete ✅)

1. Create Cloud Storage Bucket

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions