New use case: Database schema migration safety and lifecycle pipeline — DDL review, app-compat cross-check, runbook generation, and drift sentinel

🤖 **Kelos Strategist Agent** @gjkim42

## Summary

Propose a new high-value use case for Kelos: **a database schema migration safety and lifecycle pipeline**. Bad SQL/DDL migrations remain one of the top causes of production incidents (lock contention, accidentally non-backward-compatible changes, slow online operations, missing rollback paths) — yet review and operational planning are still done manually by senior engineers and DBAs. Kelos is uniquely positioned to automate this because (a) the safe-migration body of knowledge is a strong fit for AI synthesis, (b) the inputs are tightly bounded (DDL/migration files), and (c) the existing primitives — `githubPullRequests` with `filePatterns`, `commentPolicy`, `dependsOn`, and `cron` — already compose into the right multi-stage workflow. Critically, this is **distinct from existing proposals**: #926 covers framework/library version migrations (React 18→19 etc.) — application-code upgrades, not database DDL.

## Problem

### Database migrations are a top cause of production incidents

Industry surveys consistently rank schema changes among the leading causes of database-related outages. The failure modes are well-known but pervasive:

1. **Lock contention** — `ALTER TABLE ADD COLUMN NOT NULL` rewrites the whole table on most engines; on a 50M-row table this means minutes-to-hours of downtime. Adding indexes without `CONCURRENTLY` (Postgres) or `ALGORITHM=INPLACE` (MySQL) blocks writes.
2. **Backward incompatibility with running app code** — Renaming a column the app still reads, dropping a column before the application stops referencing it, or changing a type in a way that fails on existing rows. This breaks the rolling-deploy contract.
3. **Missing or unsafe rollback paths** — `DROP COLUMN` cannot be undone without backup. Many migration frameworks support `down()` migrations that are silently absent or wrong.
4. **Slow online operations** — A `UPDATE ... WHERE big_condition` on millions of rows in a single transaction can blow up WAL/binlog and replication lag.
5. **Multi-statement transactions that block replication** — DDL in a transaction holds locks longer than necessary; some statements (e.g., MySQL `CREATE INDEX` historically) implicitly commit.
6. **Cross-environment drift** — Staging and production schema diverge silently when migrations are skipped or applied out of order, and nobody notices until a failed deploy.
7. **Constraint additions without validation phase** — `ADD FOREIGN KEY` without `NOT VALID + VALIDATE` (Postgres) takes a `ShareRowExclusive` lock and validates the entire table inline.

### Why existing tooling falls short

- **`squawk`, `pglint`, `dba-cli`** — Static linters catch a known subset of antipatterns but cannot reason about repo state (e.g., "is this column used in `internal/api/handler.go`?"), application backward compatibility, or operational impact.
- **`atlas`, `dbmate`, `goose`, `flyway`, `alembic`, `knex`, golang-migrate** — Apply migrations but do not analyze risk or generate reverse migrations.
- **`gh-ost`, `pt-online-schema-change`** — Execute online operations safely but require human authoring of the target schema and operational plan.
- **PR review** — Senior engineers manually eyeball every migration. Quality is uneven, doesn't scale, and tribal knowledge isn't captured.

An autonomous agent can do something static tools can't: read the migration, read application code that references the affected tables/columns, read the `git log` of past migrations for conventions, read the engine version from CI config, and synthesize a structured risk assessment plus a concrete operational runbook.

## Proposed Solution

### Workflow Overview

```
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐    ┌──────────────┐
│ 1. Static-Risk  │    │ 2. App Compat    │    │ 3. Runbook +    │    │ 4. Sentinel: │
│    Review       │───▶│    Cross-Check   │───▶│    Reverse Gen  │───▶│ Schema-Drift │
│  (PR-trigger)   │    │ (PR-trigger)     │    │ (post-approval) │    │  (cron)      │
│                 │    │                  │    │                 │    │              │
│ • DDL safety    │    │ • Find readers   │    │ • Step-by-step  │    │ • Diff       │
│ • Lock impact   │    │   of changed     │    │   rollout       │    │   environs   │
│ • Reversibility │    │   columns        │    │ • Reverse       │    │ • Open       │
│ • Antipatterns  │    │ • Estimate       │    │   migration     │    │   issue if   │
│ • Severity tag  │    │   blast radius   │    │ • Timing est    │    │   diverged   │
└─────────────────┘    └──────────────────┘    └─────────────────┘    └──────────────┘
        │                       │                       ▲
        └───── PR comment ──────┴───── PR comment ──────┘
                                                        │
                                            triggered by `/kelos
                                            migration-runbook` comment
                                            from authorized reviewer
```

Each stage is a separate Kelos Task. Stages 1–2 run automatically per PR. Stage 3 is human-gated via `commentPolicy`. Stage 4 runs on a `cron` schedule independent of any PR.

### Example 1: Static migration safety reviewer (PR-triggered)

Targets PRs that touch migration files, regardless of label. Posts a structured risk assessment as a PR comment.

```yaml
apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
  name: db-migration-safety-reviewer
spec:
  when:
    githubPullRequests:
      state: open
      filePatterns:
        include:
          # Common migration directory conventions across frameworks:
          - "db/migrations/**/*.sql"
          - "db/migrate/**/*.rb"        # Rails
          - "migrations/**/*.sql"       # golang-migrate, dbmate, goose
          - "alembic/versions/**/*.py"  # SQLAlchemy/Alembic
          - "prisma/migrations/**/*.sql"
          - "supabase/migrations/**/*.sql"
          - "ent/migrate/migrations/**/*.sql"
          - "db/schema.rb"              # Rails schema dump
          - "atlas.hcl"                 # Atlas
        exclude:
          - "**/seeds/**"
          - "**/fixtures/**"
      reporting:
        enabled: true
  maxConcurrency: 3
  taskTemplate:
    type: claude-code
    workspaceRef:
      name: my-app
    credentials:
      type: oauth
      secretRef:
        name: claude-oauth-token
    branch: "{{.Branch}}"
    promptTemplate: |
      You are a database migration safety reviewer for PR #{{.Number}}: {{.Title}}.
      Detect the database engine (Postgres / MySQL / SQLite / etc.) from
      go.mod / package.json / requirements.txt / Gemfile or CI config.

      Review only the migration files changed in this PR. For each migration,
      analyze and report:

      ## 1. Operation classification
      Classify each statement as one of:
      - SAFE        — no lock or rewrites (e.g., add nullable column, add index CONCURRENTLY)
      - SLOW        — long-running but online (e.g., backfill UPDATE)
      - BLOCKING    — acquires a lock that blocks reads/writes
      - DESTRUCTIVE — irreversible without backup (DROP COLUMN/TABLE/INDEX)

      ## 2. Engine-specific antipattern checks
      Examples (call out matches; this is non-exhaustive):
      - Postgres: ADD COLUMN with non-volatile DEFAULT on PG 11+ is fast;
        ADD COLUMN NOT NULL without DEFAULT is fine on PG 11+; CREATE INDEX
        without CONCURRENTLY is blocking; ALTER TYPE that requires rewrite;
        ADD FOREIGN KEY without NOT VALID + VALIDATE; renaming a column.
      - MySQL: ALGORITHM=COPY operations; lack of pt-online-schema-change/
        gh-ost annotation for >1M-row tables; implicit commit on DDL.
      - SQLite: ALTER TABLE limitations; rebuilding via 12-step procedure.

      ## 3. Reversibility
      For each statement, can the migration be rolled back without data loss?
      If a `down()` migration exists, does it actually undo the up()?

      ## 4. Backward compatibility with running application
      Search the repo for code that reads/writes the affected tables/columns.
      If the app references a column being dropped/renamed/retyped, the
      migration is incompatible with a rolling deploy. Flag this explicitly.

      ## 5. Severity rating
      Overall PR severity: LOW / MEDIUM / HIGH / BLOCKER.
      BLOCKER means the migration cannot be deployed safely as-is.

      ## 6. Suggested fixes
      For each HIGH/BLOCKER issue, propose a concrete safer alternative
      (e.g., "Split into add-nullable-then-backfill-then-set-not-null
      across three migrations").

      Post the review as a single PR comment using `gh pr comment`. Use
      a fenced markdown structure with the sections above. Do NOT push
      code or modify files — review only.
    ttlSecondsAfterFinished: 3600
```

### Example 2: Application backward-compatibility cross-check (PR-triggered)

Runs in parallel with Example 1 but focuses specifically on application code references — a complementary signal to pure DDL analysis.

```yaml
apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
  name: db-migration-app-compat
spec:
  when:
    githubPullRequests:
      state: open
      filePatterns:
        include:
          - "db/migrations/**"
          - "migrations/**/*.sql"
          - "alembic/versions/**/*.py"
          - "prisma/migrations/**/*.sql"
      reporting:
        enabled: true
  maxConcurrency: 2
  taskTemplate:
    type: claude-code
    workspaceRef:
      name: my-app
    credentials:
      type: oauth
      secretRef:
        name: claude-oauth-token
    branch: "{{.Branch}}"
    promptTemplate: |
      Application backward-compatibility check for PR #{{.Number}}.

      For each migration file changed in this PR, identify every table and
      column being modified, dropped, renamed, or retyped. Then for each
      such schema element:

      1. Search the application source tree (excluding the migration dir
         itself) for references to that table/column. Use `git grep` and
         look for ORM models, raw SQL strings, query builders, and any
         generated code (e.g., sqlc, ent, prisma client).

      2. If the column is being **dropped or renamed**, but live code
         still references it, this breaks the rolling-deploy invariant.
         List every file:line that needs to change first.

      3. If the column is being **retyped**, check whether existing code
         assumes the old type's range / nullability / encoding.

      4. For new NOT NULL columns, check whether existing INSERTs in
         application code provide a value.

      Post the findings as a PR comment with this structure:

      | Schema element | Change | App references | Compatible? | Action |

      Recommend whether the migration can ship in this PR or must be
      split into a backward-compatible sequence (expand → migrate → contract).

      Do NOT modify files — analysis only.
    ttlSecondsAfterFinished: 3600
```

### Example 3: Operational runbook + reverse-migration generator (comment-triggered)

After human reviewers approve the static checks, an authorized reviewer comments `/kelos migration-runbook` to trigger the runbook stage. This is a **write-mode** task that pushes additional artifacts to the PR branch.

```yaml
apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
  name: db-migration-runbook
spec:
  when:
    githubPullRequests:
      state: open
      filePatterns:
        include:
          - "db/migrations/**"
          - "migrations/**/*.sql"
          - "alembic/versions/**/*.py"
      commentPolicy:
        triggerComment: "/kelos migration-runbook"
        minimumPermission: write
      reporting:
        enabled: true
  maxConcurrency: 1
  taskTemplate:
    type: claude-code
    workspaceRef:
      name: my-app
    credentials:
      type: oauth
      secretRef:
        name: claude-oauth-token
    branch: "{{.Branch}}"
    promptTemplate: |
      Generate operational artifacts for the migration in PR #{{.Number}}.

      For each new migration file in this PR, produce alongside it
      (e.g., `001_add_email_index.runbook.md`):

      ## Operational runbook
      - Pre-flight checks (table size, current locks, replica lag)
      - Recommended execution window (consider blocking impact)
      - Rough timing estimate based on table size — query
        `pg_stat_user_tables` / `INFORMATION_SCHEMA.TABLES` if a
        connection string secret is mounted, otherwise estimate by
        order of magnitude
      - Step-by-step rollout (e.g., expand → backfill → contract phases)
      - Rollback procedure (with exact SQL)
      - Monitoring signals to watch (lock waits, replication lag,
        error rates) and abort criteria

      ## Reverse migration
      For frameworks that support up/down (golang-migrate, alembic, knex):
      generate the corresponding `down` migration. Verify it's actually
      reversible — for irreversible changes (DROP COLUMN/TABLE), produce
      the reverse as a comment with the explicit warning that data loss
      will occur and a backup is required.

      Commit the runbook and reverse migration to the PR branch and push.
      Add a PR comment summarizing what was generated and where reviewers
      should focus their final attention.
    ttlSecondsAfterFinished: 7200
```

### Example 4: Schema-drift sentinel (cron-scheduled)

Independent of any PR. Periodically diffs the desired schema (from migration files / `schema.sql` dump) against applied schemas in non-prod environments to catch drift before it causes a deploy failure.

```yaml
apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
  name: db-schema-drift-sentinel
spec:
  when:
    cron:
      schedule: "0 6 * * *"  # Daily at 06:00 UTC
  maxConcurrency: 1
  taskTemplate:
    type: claude-code
    workspaceRef:
      name: my-app
    credentials:
      type: oauth
      secretRef:
        name: claude-oauth-token
    branch: main
    podOverrides:
      env:
        - name: STAGING_DB_URL
          valueFrom:
            secretKeyRef:
              name: staging-readonly-db
              key: DATABASE_URL
        - name: PROD_DB_URL
          valueFrom:
            secretKeyRef:
              name: prod-readonly-db
              key: DATABASE_URL
    promptTemplate: |
      Detect schema drift between the migration history in this repo and
      the live schemas of staging and production.

      1. Determine the database engine and reconstruct the desired schema
         from the migration files (use `atlas schema inspect`, `pg_dump
         --schema-only`, or `mysqldump --no-data` against a temporary
         throwaway DB after applying migrations — or use the framework's
         own dry-run feature).

      2. For each environment (staging then prod), connect read-only and
         dump the live schema.

      3. Run a structural diff. Flag ONLY meaningful divergence — ignore
         column ordering, comments, default value formatting differences.

      4. For each divergence, classify:
         - "drift: applied locally, missing in repo" — an out-of-band hand-edit
         - "drift: in repo, missing in env" — a migration was skipped
         - "version skew: env has older head than repo HEAD"

      5. If any divergence is found, open a single GitHub issue titled
         "Schema drift detected: <env>" with the diff and recommended
         remediation. Skip if a similar open issue already exists. Label
         it `db/drift`. If no drift, exit without creating any issue
         (do not create empty 'all clear' issues).

      Do NOT modify the live database under any circumstance.
    ttlSecondsAfterFinished: 3600
```

## Patterns the agent should detect

A non-exhaustive checklist the prompt can be tightened against over time. This is the "expert knowledge" most teams want captured in tooling:

| Pattern | Engine | Severity | Safer alternative |
|---|---|---|---|
| `ALTER TABLE ADD COLUMN NOT NULL` without DEFAULT on big table | Postgres ≤10 / MySQL | HIGH | Add nullable → backfill → SET NOT NULL |
| `ALTER TABLE ADD COLUMN ... DEFAULT volatile_func()` | Postgres ≥11 | HIGH | Use immutable default or split |
| `CREATE INDEX` (not CONCURRENTLY) | Postgres | HIGH | `CREATE INDEX CONCURRENTLY` |
| `ALTER TABLE ADD FOREIGN KEY` without NOT VALID | Postgres | MEDIUM | `ADD ... NOT VALID; VALIDATE CONSTRAINT;` |
| `ALTER COLUMN TYPE` requiring rewrite | Postgres | HIGH | New column + backfill + swap |
| `DROP COLUMN` while app still references | Any | BLOCKER | Stop reading first → release → drop next deploy |
| Renaming a column/table | Any | BLOCKER | Add new + dual-write + cutover + drop |
| Big single-statement `UPDATE` | Any | HIGH | Batched updates with smaller transactions |
| MySQL DDL on InnoDB without ONLINE/INPLACE clause | MySQL | HIGH | gh-ost / pt-online-schema-change |
| Down migration that's empty or wrong | Any | MEDIUM | Generate correct reverse |
| Two unrelated changes in one migration | Any | LOW | Split for atomic rollback |

## Kelos Features Leveraged

| Feature | How it's used |
|---|---|
| `githubPullRequests` source | Triggers safety + compat reviewers on every PR |
| `filePatterns.include/exclude` (#778) | Targets only migration directories — no false positives on app PRs |
| `commentPolicy.triggerComment` + `minimumPermission` | Gates the runbook generator behind a write-permission reviewer |
| `reporting.enabled` | Posts review findings as PR comments |
| `cron` source | Drives the daily drift sentinel |
| `dependsOn` (Task pipeline) | Optional: chain reviewer → runbook generator as a strict pipeline |
| `podOverrides.env` + `valueFrom.secretKeyRef` | Mount read-only DB connection strings into the drift sentinel |
| `maxConcurrency` | Prevents runaway parallel runs on a busy repo |
| `AgentConfig` (optional) | Pre-canned safe-migration knowledge file shared across all four spawners |

## Relationship to Existing Proposals

| Issue | Relationship |
|---|---|
| #926 (Framework/language version migration) | **Distinct.** That issue covers application-code upgrades (React 18→19, Go 1.21→1.23). This issue covers database DDL migrations — a fundamentally different problem (lock acquisition, online operations, replication). |
| #778 (filePatterns) — closed | **Required dependency.** This use case is unviable without it; with it, the implementation is straightforward. |
| #946 (CI/CD failure auto-remediation) | Complementary. If a deploy fails because of a schema-drift mismatch, #946's auto-remediation could open a fix PR; this proposal prevents the failure upstream. |
| #983 (TaskPipeline CRD) | Enhancement. The Static→AppCompat→Runbook flow could be expressed as a single TaskPipeline rather than three independent spawners. |
| #881 (contextSources) | Enhancement. Pulling current Postgres/MySQL release notes or an internal "safe migrations playbook" into the prompt would tighten the analysis. |
| #887 (completionPolicy) | Enhancement. A completion check that the runbook task actually committed a `*.runbook.md` file before marking the Task succeeded. |
| #743 (codebase health) — closed | Different scope. Health monitoring is broad and scheduled; this is a precision review on a specific file class with a specific risk model. |
| #884 (cross-repo change propagation) | Composable. After a schema migration ships in the data-layer repo, a downstream propagation could regenerate ORM clients in consumer repos. |

## Why This Matters

- **Universally applicable.** Every team that owns a database has migrations. Unlike many proposed use cases that target a specific subdomain, this one applies to virtually all of Kelos's potential audience.
- **Senior-engineer leverage.** Migration review is a recurring high-attention task that senior engineers and DBAs do manually today. Automating the first pass with an AI agent and reserving humans for judgment calls is a textbook leverage win.
- **Strong demo material.** A PR comment that says "BLOCKER: this `ALTER TABLE … TYPE` will rewrite a 50M-row table; here's the safer 3-migration sequence" is a vivid, instantly-comprehensible demonstration of Kelos's value.
- **Low blast radius.** The reviewer stages are read-only. The runbook stage is human-gated. The drift sentinel reads from non-prod first. Failure modes are bounded — the worst case is a noisy PR comment.
- **Compounds existing work.** Reuses recently-shipped `filePatterns` (#778), `commentPolicy`, and the existing PR-source tooling. No new CRD or controller change required to ship the first iteration as an example under `examples/`.

## Suggested First Step

Add an `examples/13-taskspawner-db-migration-review/` directory containing Examples 1–2 above plus a README explaining the antipattern catalog and how to extend it for the team's specific engine. Examples 3 and 4 can land in subsequent iterations once the basic pattern proves out.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New use case: Database schema migration safety and lifecycle pipeline — DDL review, app-compat cross-check, runbook generation, and drift sentinel #1028

Summary

Problem

Database migrations are a top cause of production incidents

Why existing tooling falls short

Proposed Solution

Workflow Overview

Example 1: Static migration safety reviewer (PR-triggered)

Example 2: Application backward-compatibility cross-check (PR-triggered)

Example 3: Operational runbook + reverse-migration generator (comment-triggered)

Example 4: Schema-drift sentinel (cron-scheduled)

Patterns the agent should detect

Kelos Features Leveraged

Relationship to Existing Proposals

Why This Matters

Suggested First Step

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pattern	Engine	Severity	Safer alternative
`ALTER TABLE ADD COLUMN NOT NULL` without DEFAULT on big table	Postgres ≤10 / MySQL	HIGH	Add nullable → backfill → SET NOT NULL
`ALTER TABLE ADD COLUMN ... DEFAULT volatile_func()`	Postgres ≥11	HIGH	Use immutable default or split
`CREATE INDEX` (not CONCURRENTLY)	Postgres	HIGH	`CREATE INDEX CONCURRENTLY`
`ALTER TABLE ADD FOREIGN KEY` without NOT VALID	Postgres	MEDIUM	`ADD ... NOT VALID; VALIDATE CONSTRAINT;`
`ALTER COLUMN TYPE` requiring rewrite	Postgres	HIGH	New column + backfill + swap
`DROP COLUMN` while app still references	Any	BLOCKER	Stop reading first → release → drop next deploy
Renaming a column/table	Any	BLOCKER	Add new + dual-write + cutover + drop
Big single-statement `UPDATE`	Any	HIGH	Batched updates with smaller transactions
MySQL DDL on InnoDB without ONLINE/INPLACE clause	MySQL	HIGH	gh-ost / pt-online-schema-change
Down migration that's empty or wrong	Any	MEDIUM	Generate correct reverse
Two unrelated changes in one migration	Any	LOW	Split for atomic rollback

Feature	How it's used
`githubPullRequests` source	Triggers safety + compat reviewers on every PR
`filePatterns.include/exclude` (#778)	Targets only migration directories — no false positives on app PRs
`commentPolicy.triggerComment` + `minimumPermission`	Gates the runbook generator behind a write-permission reviewer
`reporting.enabled`	Posts review findings as PR comments
`cron` source	Drives the daily drift sentinel
`dependsOn` (Task pipeline)	Optional: chain reviewer → runbook generator as a strict pipeline
`podOverrides.env` + `valueFrom.secretKeyRef`	Mount read-only DB connection strings into the drift sentinel
`maxConcurrency`	Prevents runaway parallel runs on a busy repo
`AgentConfig` (optional)	Pre-canned safe-migration knowledge file shared across all four spawners

Issue	Relationship
#926 (Framework/language version migration)	Distinct. That issue covers application-code upgrades (React 18→19, Go 1.21→1.23). This issue covers database DDL migrations — a fundamentally different problem (lock acquisition, online operations, replication).
#778 (filePatterns) — closed	Required dependency. This use case is unviable without it; with it, the implementation is straightforward.
#946 (CI/CD failure auto-remediation)	Complementary. If a deploy fails because of a schema-drift mismatch, #946's auto-remediation could open a fix PR; this proposal prevents the failure upstream.
#983 (TaskPipeline CRD)	Enhancement. The Static→AppCompat→Runbook flow could be expressed as a single TaskPipeline rather than three independent spawners.
#881 (contextSources)	Enhancement. Pulling current Postgres/MySQL release notes or an internal "safe migrations playbook" into the prompt would tighten the analysis.
#887 (completionPolicy)	Enhancement. A completion check that the runbook task actually committed a `*.runbook.md` file before marking the Task succeeded.
#743 (codebase health) — closed	Different scope. Health monitoring is broad and scheduled; this is a precision review on a specific file class with a specific risk model.
#884 (cross-repo change propagation)	Composable. After a schema migration ships in the data-layer repo, a downstream propagation could regenerate ORM clients in consumer repos.

New use case: Database schema migration safety and lifecycle pipeline — DDL review, app-compat cross-check, runbook generation, and drift sentinel #1028

Description

Summary

Problem

Database migrations are a top cause of production incidents

Why existing tooling falls short

Proposed Solution

Workflow Overview

Example 1: Static migration safety reviewer (PR-triggered)

Example 2: Application backward-compatibility cross-check (PR-triggered)

Example 3: Operational runbook + reverse-migration generator (comment-triggered)

Example 4: Schema-drift sentinel (cron-scheduled)

Patterns the agent should detect

Kelos Features Leveraged

Relationship to Existing Proposals

Why This Matters

Suggested First Step

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions