Skip to content

New use case: Database schema migration safety and lifecycle pipeline β€” DDL review, app-compat cross-check, runbook generation, and drift sentinelΒ #1028

@kelos-bot

Description

@kelos-bot

πŸ€– Kelos Strategist Agent @gjkim42

Summary

Propose a new high-value use case for Kelos: a database schema migration safety and lifecycle pipeline. Bad SQL/DDL migrations remain one of the top causes of production incidents (lock contention, accidentally non-backward-compatible changes, slow online operations, missing rollback paths) β€” yet review and operational planning are still done manually by senior engineers and DBAs. Kelos is uniquely positioned to automate this because (a) the safe-migration body of knowledge is a strong fit for AI synthesis, (b) the inputs are tightly bounded (DDL/migration files), and (c) the existing primitives β€” githubPullRequests with filePatterns, commentPolicy, dependsOn, and cron β€” already compose into the right multi-stage workflow. Critically, this is distinct from existing proposals: #926 covers framework/library version migrations (React 18β†’19 etc.) β€” application-code upgrades, not database DDL.

Problem

Database migrations are a top cause of production incidents

Industry surveys consistently rank schema changes among the leading causes of database-related outages. The failure modes are well-known but pervasive:

  1. Lock contention β€” ALTER TABLE ADD COLUMN NOT NULL rewrites the whole table on most engines; on a 50M-row table this means minutes-to-hours of downtime. Adding indexes without CONCURRENTLY (Postgres) or ALGORITHM=INPLACE (MySQL) blocks writes.
  2. Backward incompatibility with running app code β€” Renaming a column the app still reads, dropping a column before the application stops referencing it, or changing a type in a way that fails on existing rows. This breaks the rolling-deploy contract.
  3. Missing or unsafe rollback paths β€” DROP COLUMN cannot be undone without backup. Many migration frameworks support down() migrations that are silently absent or wrong.
  4. Slow online operations β€” A UPDATE ... WHERE big_condition on millions of rows in a single transaction can blow up WAL/binlog and replication lag.
  5. Multi-statement transactions that block replication β€” DDL in a transaction holds locks longer than necessary; some statements (e.g., MySQL CREATE INDEX historically) implicitly commit.
  6. Cross-environment drift β€” Staging and production schema diverge silently when migrations are skipped or applied out of order, and nobody notices until a failed deploy.
  7. Constraint additions without validation phase β€” ADD FOREIGN KEY without NOT VALID + VALIDATE (Postgres) takes a ShareRowExclusive lock and validates the entire table inline.

Why existing tooling falls short

  • squawk, pglint, dba-cli β€” Static linters catch a known subset of antipatterns but cannot reason about repo state (e.g., "is this column used in internal/api/handler.go?"), application backward compatibility, or operational impact.
  • atlas, dbmate, goose, flyway, alembic, knex, golang-migrate β€” Apply migrations but do not analyze risk or generate reverse migrations.
  • gh-ost, pt-online-schema-change β€” Execute online operations safely but require human authoring of the target schema and operational plan.
  • PR review β€” Senior engineers manually eyeball every migration. Quality is uneven, doesn't scale, and tribal knowledge isn't captured.

An autonomous agent can do something static tools can't: read the migration, read application code that references the affected tables/columns, read the git log of past migrations for conventions, read the engine version from CI config, and synthesize a structured risk assessment plus a concrete operational runbook.

Proposed Solution

Workflow Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. Static-Risk  β”‚    β”‚ 2. App Compat    β”‚    β”‚ 3. Runbook +    β”‚    β”‚ 4. Sentinel: β”‚
β”‚    Review       │───▢│    Cross-Check   │───▢│    Reverse Gen  │───▢│ Schema-Drift β”‚
β”‚  (PR-trigger)   β”‚    β”‚ (PR-trigger)     β”‚    β”‚ (post-approval) β”‚    β”‚  (cron)      β”‚
β”‚                 β”‚    β”‚                  β”‚    β”‚                 β”‚    β”‚              β”‚
β”‚ β€’ DDL safety    β”‚    β”‚ β€’ Find readers   β”‚    β”‚ β€’ Step-by-step  β”‚    β”‚ β€’ Diff       β”‚
β”‚ β€’ Lock impact   β”‚    β”‚   of changed     β”‚    β”‚   rollout       β”‚    β”‚   environs   β”‚
β”‚ β€’ Reversibility β”‚    β”‚   columns        β”‚    β”‚ β€’ Reverse       β”‚    β”‚ β€’ Open       β”‚
β”‚ β€’ Antipatterns  β”‚    β”‚ β€’ Estimate       β”‚    β”‚   migration     β”‚    β”‚   issue if   β”‚
β”‚ β€’ Severity tag  β”‚    β”‚   blast radius   β”‚    β”‚ β€’ Timing est    β”‚    β”‚   diverged   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚                       β”‚                       β–²
        └───── PR comment ──────┴───── PR comment β”€β”€β”€β”€β”€β”€β”˜
                                                        β”‚
                                            triggered by `/kelos
                                            migration-runbook` comment
                                            from authorized reviewer

Each stage is a separate Kelos Task. Stages 1–2 run automatically per PR. Stage 3 is human-gated via commentPolicy. Stage 4 runs on a cron schedule independent of any PR.

Example 1: Static migration safety reviewer (PR-triggered)

Targets PRs that touch migration files, regardless of label. Posts a structured risk assessment as a PR comment.

apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
  name: db-migration-safety-reviewer
spec:
  when:
    githubPullRequests:
      state: open
      filePatterns:
        include:
          # Common migration directory conventions across frameworks:
          - "db/migrations/**/*.sql"
          - "db/migrate/**/*.rb"        # Rails
          - "migrations/**/*.sql"       # golang-migrate, dbmate, goose
          - "alembic/versions/**/*.py"  # SQLAlchemy/Alembic
          - "prisma/migrations/**/*.sql"
          - "supabase/migrations/**/*.sql"
          - "ent/migrate/migrations/**/*.sql"
          - "db/schema.rb"              # Rails schema dump
          - "atlas.hcl"                 # Atlas
        exclude:
          - "**/seeds/**"
          - "**/fixtures/**"
      reporting:
        enabled: true
  maxConcurrency: 3
  taskTemplate:
    type: claude-code
    workspaceRef:
      name: my-app
    credentials:
      type: oauth
      secretRef:
        name: claude-oauth-token
    branch: "{{.Branch}}"
    promptTemplate: |
      You are a database migration safety reviewer for PR #{{.Number}}: {{.Title}}.
      Detect the database engine (Postgres / MySQL / SQLite / etc.) from
      go.mod / package.json / requirements.txt / Gemfile or CI config.

      Review only the migration files changed in this PR. For each migration,
      analyze and report:

      ## 1. Operation classification
      Classify each statement as one of:
      - SAFE        β€” no lock or rewrites (e.g., add nullable column, add index CONCURRENTLY)
      - SLOW        β€” long-running but online (e.g., backfill UPDATE)
      - BLOCKING    β€” acquires a lock that blocks reads/writes
      - DESTRUCTIVE β€” irreversible without backup (DROP COLUMN/TABLE/INDEX)

      ## 2. Engine-specific antipattern checks
      Examples (call out matches; this is non-exhaustive):
      - Postgres: ADD COLUMN with non-volatile DEFAULT on PG 11+ is fast;
        ADD COLUMN NOT NULL without DEFAULT is fine on PG 11+; CREATE INDEX
        without CONCURRENTLY is blocking; ALTER TYPE that requires rewrite;
        ADD FOREIGN KEY without NOT VALID + VALIDATE; renaming a column.
      - MySQL: ALGORITHM=COPY operations; lack of pt-online-schema-change/
        gh-ost annotation for >1M-row tables; implicit commit on DDL.
      - SQLite: ALTER TABLE limitations; rebuilding via 12-step procedure.

      ## 3. Reversibility
      For each statement, can the migration be rolled back without data loss?
      If a `down()` migration exists, does it actually undo the up()?

      ## 4. Backward compatibility with running application
      Search the repo for code that reads/writes the affected tables/columns.
      If the app references a column being dropped/renamed/retyped, the
      migration is incompatible with a rolling deploy. Flag this explicitly.

      ## 5. Severity rating
      Overall PR severity: LOW / MEDIUM / HIGH / BLOCKER.
      BLOCKER means the migration cannot be deployed safely as-is.

      ## 6. Suggested fixes
      For each HIGH/BLOCKER issue, propose a concrete safer alternative
      (e.g., "Split into add-nullable-then-backfill-then-set-not-null
      across three migrations").

      Post the review as a single PR comment using `gh pr comment`. Use
      a fenced markdown structure with the sections above. Do NOT push
      code or modify files β€” review only.
    ttlSecondsAfterFinished: 3600

Example 2: Application backward-compatibility cross-check (PR-triggered)

Runs in parallel with Example 1 but focuses specifically on application code references β€” a complementary signal to pure DDL analysis.

apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
  name: db-migration-app-compat
spec:
  when:
    githubPullRequests:
      state: open
      filePatterns:
        include:
          - "db/migrations/**"
          - "migrations/**/*.sql"
          - "alembic/versions/**/*.py"
          - "prisma/migrations/**/*.sql"
      reporting:
        enabled: true
  maxConcurrency: 2
  taskTemplate:
    type: claude-code
    workspaceRef:
      name: my-app
    credentials:
      type: oauth
      secretRef:
        name: claude-oauth-token
    branch: "{{.Branch}}"
    promptTemplate: |
      Application backward-compatibility check for PR #{{.Number}}.

      For each migration file changed in this PR, identify every table and
      column being modified, dropped, renamed, or retyped. Then for each
      such schema element:

      1. Search the application source tree (excluding the migration dir
         itself) for references to that table/column. Use `git grep` and
         look for ORM models, raw SQL strings, query builders, and any
         generated code (e.g., sqlc, ent, prisma client).

      2. If the column is being **dropped or renamed**, but live code
         still references it, this breaks the rolling-deploy invariant.
         List every file:line that needs to change first.

      3. If the column is being **retyped**, check whether existing code
         assumes the old type's range / nullability / encoding.

      4. For new NOT NULL columns, check whether existing INSERTs in
         application code provide a value.

      Post the findings as a PR comment with this structure:

      | Schema element | Change | App references | Compatible? | Action |

      Recommend whether the migration can ship in this PR or must be
      split into a backward-compatible sequence (expand β†’ migrate β†’ contract).

      Do NOT modify files β€” analysis only.
    ttlSecondsAfterFinished: 3600

Example 3: Operational runbook + reverse-migration generator (comment-triggered)

After human reviewers approve the static checks, an authorized reviewer comments /kelos migration-runbook to trigger the runbook stage. This is a write-mode task that pushes additional artifacts to the PR branch.

apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
  name: db-migration-runbook
spec:
  when:
    githubPullRequests:
      state: open
      filePatterns:
        include:
          - "db/migrations/**"
          - "migrations/**/*.sql"
          - "alembic/versions/**/*.py"
      commentPolicy:
        triggerComment: "/kelos migration-runbook"
        minimumPermission: write
      reporting:
        enabled: true
  maxConcurrency: 1
  taskTemplate:
    type: claude-code
    workspaceRef:
      name: my-app
    credentials:
      type: oauth
      secretRef:
        name: claude-oauth-token
    branch: "{{.Branch}}"
    promptTemplate: |
      Generate operational artifacts for the migration in PR #{{.Number}}.

      For each new migration file in this PR, produce alongside it
      (e.g., `001_add_email_index.runbook.md`):

      ## Operational runbook
      - Pre-flight checks (table size, current locks, replica lag)
      - Recommended execution window (consider blocking impact)
      - Rough timing estimate based on table size β€” query
        `pg_stat_user_tables` / `INFORMATION_SCHEMA.TABLES` if a
        connection string secret is mounted, otherwise estimate by
        order of magnitude
      - Step-by-step rollout (e.g., expand β†’ backfill β†’ contract phases)
      - Rollback procedure (with exact SQL)
      - Monitoring signals to watch (lock waits, replication lag,
        error rates) and abort criteria

      ## Reverse migration
      For frameworks that support up/down (golang-migrate, alembic, knex):
      generate the corresponding `down` migration. Verify it's actually
      reversible β€” for irreversible changes (DROP COLUMN/TABLE), produce
      the reverse as a comment with the explicit warning that data loss
      will occur and a backup is required.

      Commit the runbook and reverse migration to the PR branch and push.
      Add a PR comment summarizing what was generated and where reviewers
      should focus their final attention.
    ttlSecondsAfterFinished: 7200

Example 4: Schema-drift sentinel (cron-scheduled)

Independent of any PR. Periodically diffs the desired schema (from migration files / schema.sql dump) against applied schemas in non-prod environments to catch drift before it causes a deploy failure.

apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
  name: db-schema-drift-sentinel
spec:
  when:
    cron:
      schedule: "0 6 * * *"  # Daily at 06:00 UTC
  maxConcurrency: 1
  taskTemplate:
    type: claude-code
    workspaceRef:
      name: my-app
    credentials:
      type: oauth
      secretRef:
        name: claude-oauth-token
    branch: main
    podOverrides:
      env:
        - name: STAGING_DB_URL
          valueFrom:
            secretKeyRef:
              name: staging-readonly-db
              key: DATABASE_URL
        - name: PROD_DB_URL
          valueFrom:
            secretKeyRef:
              name: prod-readonly-db
              key: DATABASE_URL
    promptTemplate: |
      Detect schema drift between the migration history in this repo and
      the live schemas of staging and production.

      1. Determine the database engine and reconstruct the desired schema
         from the migration files (use `atlas schema inspect`, `pg_dump
         --schema-only`, or `mysqldump --no-data` against a temporary
         throwaway DB after applying migrations β€” or use the framework's
         own dry-run feature).

      2. For each environment (staging then prod), connect read-only and
         dump the live schema.

      3. Run a structural diff. Flag ONLY meaningful divergence β€” ignore
         column ordering, comments, default value formatting differences.

      4. For each divergence, classify:
         - "drift: applied locally, missing in repo" β€” an out-of-band hand-edit
         - "drift: in repo, missing in env" β€” a migration was skipped
         - "version skew: env has older head than repo HEAD"

      5. If any divergence is found, open a single GitHub issue titled
         "Schema drift detected: <env>" with the diff and recommended
         remediation. Skip if a similar open issue already exists. Label
         it `db/drift`. If no drift, exit without creating any issue
         (do not create empty 'all clear' issues).

      Do NOT modify the live database under any circumstance.
    ttlSecondsAfterFinished: 3600

Patterns the agent should detect

A non-exhaustive checklist the prompt can be tightened against over time. This is the "expert knowledge" most teams want captured in tooling:

Pattern Engine Severity Safer alternative
ALTER TABLE ADD COLUMN NOT NULL without DEFAULT on big table Postgres ≀10 / MySQL HIGH Add nullable β†’ backfill β†’ SET NOT NULL
ALTER TABLE ADD COLUMN ... DEFAULT volatile_func() Postgres β‰₯11 HIGH Use immutable default or split
CREATE INDEX (not CONCURRENTLY) Postgres HIGH CREATE INDEX CONCURRENTLY
ALTER TABLE ADD FOREIGN KEY without NOT VALID Postgres MEDIUM ADD ... NOT VALID; VALIDATE CONSTRAINT;
ALTER COLUMN TYPE requiring rewrite Postgres HIGH New column + backfill + swap
DROP COLUMN while app still references Any BLOCKER Stop reading first β†’ release β†’ drop next deploy
Renaming a column/table Any BLOCKER Add new + dual-write + cutover + drop
Big single-statement UPDATE Any HIGH Batched updates with smaller transactions
MySQL DDL on InnoDB without ONLINE/INPLACE clause MySQL HIGH gh-ost / pt-online-schema-change
Down migration that's empty or wrong Any MEDIUM Generate correct reverse
Two unrelated changes in one migration Any LOW Split for atomic rollback

Kelos Features Leveraged

Feature How it's used
githubPullRequests source Triggers safety + compat reviewers on every PR
filePatterns.include/exclude (#778) Targets only migration directories β€” no false positives on app PRs
commentPolicy.triggerComment + minimumPermission Gates the runbook generator behind a write-permission reviewer
reporting.enabled Posts review findings as PR comments
cron source Drives the daily drift sentinel
dependsOn (Task pipeline) Optional: chain reviewer β†’ runbook generator as a strict pipeline
podOverrides.env + valueFrom.secretKeyRef Mount read-only DB connection strings into the drift sentinel
maxConcurrency Prevents runaway parallel runs on a busy repo
AgentConfig (optional) Pre-canned safe-migration knowledge file shared across all four spawners

Relationship to Existing Proposals

Issue Relationship
#926 (Framework/language version migration) Distinct. That issue covers application-code upgrades (React 18β†’19, Go 1.21β†’1.23). This issue covers database DDL migrations β€” a fundamentally different problem (lock acquisition, online operations, replication).
#778 (filePatterns) β€” closed Required dependency. This use case is unviable without it; with it, the implementation is straightforward.
#946 (CI/CD failure auto-remediation) Complementary. If a deploy fails because of a schema-drift mismatch, #946's auto-remediation could open a fix PR; this proposal prevents the failure upstream.
#983 (TaskPipeline CRD) Enhancement. The Static→AppCompat→Runbook flow could be expressed as a single TaskPipeline rather than three independent spawners.
#881 (contextSources) Enhancement. Pulling current Postgres/MySQL release notes or an internal "safe migrations playbook" into the prompt would tighten the analysis.
#887 (completionPolicy) Enhancement. A completion check that the runbook task actually committed a *.runbook.md file before marking the Task succeeded.
#743 (codebase health) β€” closed Different scope. Health monitoring is broad and scheduled; this is a precision review on a specific file class with a specific risk model.
#884 (cross-repo change propagation) Composable. After a schema migration ships in the data-layer repo, a downstream propagation could regenerate ORM clients in consumer repos.

Why This Matters

  • Universally applicable. Every team that owns a database has migrations. Unlike many proposed use cases that target a specific subdomain, this one applies to virtually all of Kelos's potential audience.
  • Senior-engineer leverage. Migration review is a recurring high-attention task that senior engineers and DBAs do manually today. Automating the first pass with an AI agent and reserving humans for judgment calls is a textbook leverage win.
  • Strong demo material. A PR comment that says "BLOCKER: this ALTER TABLE … TYPE will rewrite a 50M-row table; here's the safer 3-migration sequence" is a vivid, instantly-comprehensible demonstration of Kelos's value.
  • Low blast radius. The reviewer stages are read-only. The runbook stage is human-gated. The drift sentinel reads from non-prod first. Failure modes are bounded β€” the worst case is a noisy PR comment.
  • Compounds existing work. Reuses recently-shipped filePatterns (API: Add filePatterns filter and ChangedFiles enrichment to githubPullRequests for content-aware task routingΒ #778), commentPolicy, and the existing PR-source tooling. No new CRD or controller change required to ship the first iteration as an example under examples/.

Suggested First Step

Add an examples/13-taskspawner-db-migration-review/ directory containing Examples 1–2 above plus a README explaining the antipattern catalog and how to extend it for the team's specific engine. Examples 3 and 4 can land in subsequent iterations once the basic pattern proves out.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions