You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Propose a new high-value use case for Kelos: a database schema migration safety and lifecycle pipeline. Bad SQL/DDL migrations remain one of the top causes of production incidents (lock contention, accidentally non-backward-compatible changes, slow online operations, missing rollback paths) β yet review and operational planning are still done manually by senior engineers and DBAs. Kelos is uniquely positioned to automate this because (a) the safe-migration body of knowledge is a strong fit for AI synthesis, (b) the inputs are tightly bounded (DDL/migration files), and (c) the existing primitives β githubPullRequests with filePatterns, commentPolicy, dependsOn, and cron β already compose into the right multi-stage workflow. Critically, this is distinct from existing proposals: #926 covers framework/library version migrations (React 18β19 etc.) β application-code upgrades, not database DDL.
Problem
Database migrations are a top cause of production incidents
Industry surveys consistently rank schema changes among the leading causes of database-related outages. The failure modes are well-known but pervasive:
Lock contention β ALTER TABLE ADD COLUMN NOT NULL rewrites the whole table on most engines; on a 50M-row table this means minutes-to-hours of downtime. Adding indexes without CONCURRENTLY (Postgres) or ALGORITHM=INPLACE (MySQL) blocks writes.
Backward incompatibility with running app code β Renaming a column the app still reads, dropping a column before the application stops referencing it, or changing a type in a way that fails on existing rows. This breaks the rolling-deploy contract.
Missing or unsafe rollback paths β DROP COLUMN cannot be undone without backup. Many migration frameworks support down() migrations that are silently absent or wrong.
Slow online operations β A UPDATE ... WHERE big_condition on millions of rows in a single transaction can blow up WAL/binlog and replication lag.
Multi-statement transactions that block replication β DDL in a transaction holds locks longer than necessary; some statements (e.g., MySQL CREATE INDEX historically) implicitly commit.
Cross-environment drift β Staging and production schema diverge silently when migrations are skipped or applied out of order, and nobody notices until a failed deploy.
Constraint additions without validation phase β ADD FOREIGN KEY without NOT VALID + VALIDATE (Postgres) takes a ShareRowExclusive lock and validates the entire table inline.
Why existing tooling falls short
squawk, pglint, dba-cli β Static linters catch a known subset of antipatterns but cannot reason about repo state (e.g., "is this column used in internal/api/handler.go?"), application backward compatibility, or operational impact.
atlas, dbmate, goose, flyway, alembic, knex, golang-migrate β Apply migrations but do not analyze risk or generate reverse migrations.
gh-ost, pt-online-schema-change β Execute online operations safely but require human authoring of the target schema and operational plan.
PR review β Senior engineers manually eyeball every migration. Quality is uneven, doesn't scale, and tribal knowledge isn't captured.
An autonomous agent can do something static tools can't: read the migration, read application code that references the affected tables/columns, read the git log of past migrations for conventions, read the engine version from CI config, and synthesize a structured risk assessment plus a concrete operational runbook.
Each stage is a separate Kelos Task. Stages 1β2 run automatically per PR. Stage 3 is human-gated via commentPolicy. Stage 4 runs on a cron schedule independent of any PR.
Example 1: Static migration safety reviewer (PR-triggered)
Targets PRs that touch migration files, regardless of label. Posts a structured risk assessment as a PR comment.
apiVersion: kelos.dev/v1alpha1kind: TaskSpawnermetadata:
name: db-migration-safety-reviewerspec:
when:
githubPullRequests:
state: openfilePatterns:
include:
# Common migration directory conventions across frameworks:
- "db/migrations/**/*.sql"
- "db/migrate/**/*.rb"# Rails
- "migrations/**/*.sql"# golang-migrate, dbmate, goose
- "alembic/versions/**/*.py"# SQLAlchemy/Alembic
- "prisma/migrations/**/*.sql"
- "supabase/migrations/**/*.sql"
- "ent/migrate/migrations/**/*.sql"
- "db/schema.rb"# Rails schema dump
- "atlas.hcl"# Atlasexclude:
- "**/seeds/**"
- "**/fixtures/**"reporting:
enabled: truemaxConcurrency: 3taskTemplate:
type: claude-codeworkspaceRef:
name: my-appcredentials:
type: oauthsecretRef:
name: claude-oauth-tokenbranch: "{{.Branch}}"promptTemplate: | You are a database migration safety reviewer for PR #{{.Number}}: {{.Title}}. Detect the database engine (Postgres / MySQL / SQLite / etc.) from go.mod / package.json / requirements.txt / Gemfile or CI config. Review only the migration files changed in this PR. For each migration, analyze and report: ## 1. Operation classification Classify each statement as one of: - SAFE β no lock or rewrites (e.g., add nullable column, add index CONCURRENTLY) - SLOW β long-running but online (e.g., backfill UPDATE) - BLOCKING β acquires a lock that blocks reads/writes - DESTRUCTIVE β irreversible without backup (DROP COLUMN/TABLE/INDEX) ## 2. Engine-specific antipattern checks Examples (call out matches; this is non-exhaustive): - Postgres: ADD COLUMN with non-volatile DEFAULT on PG 11+ is fast; ADD COLUMN NOT NULL without DEFAULT is fine on PG 11+; CREATE INDEX without CONCURRENTLY is blocking; ALTER TYPE that requires rewrite; ADD FOREIGN KEY without NOT VALID + VALIDATE; renaming a column. - MySQL: ALGORITHM=COPY operations; lack of pt-online-schema-change/ gh-ost annotation for >1M-row tables; implicit commit on DDL. - SQLite: ALTER TABLE limitations; rebuilding via 12-step procedure. ## 3. Reversibility For each statement, can the migration be rolled back without data loss? If a `down()` migration exists, does it actually undo the up()? ## 4. Backward compatibility with running application Search the repo for code that reads/writes the affected tables/columns. If the app references a column being dropped/renamed/retyped, the migration is incompatible with a rolling deploy. Flag this explicitly. ## 5. Severity rating Overall PR severity: LOW / MEDIUM / HIGH / BLOCKER. BLOCKER means the migration cannot be deployed safely as-is. ## 6. Suggested fixes For each HIGH/BLOCKER issue, propose a concrete safer alternative (e.g., "Split into add-nullable-then-backfill-then-set-not-null across three migrations"). Post the review as a single PR comment using `gh pr comment`. Use a fenced markdown structure with the sections above. Do NOT push code or modify files β review only.ttlSecondsAfterFinished: 3600
Example 2: Application backward-compatibility cross-check (PR-triggered)
Runs in parallel with Example 1 but focuses specifically on application code references β a complementary signal to pure DDL analysis.
apiVersion: kelos.dev/v1alpha1kind: TaskSpawnermetadata:
name: db-migration-app-compatspec:
when:
githubPullRequests:
state: openfilePatterns:
include:
- "db/migrations/**"
- "migrations/**/*.sql"
- "alembic/versions/**/*.py"
- "prisma/migrations/**/*.sql"reporting:
enabled: truemaxConcurrency: 2taskTemplate:
type: claude-codeworkspaceRef:
name: my-appcredentials:
type: oauthsecretRef:
name: claude-oauth-tokenbranch: "{{.Branch}}"promptTemplate: | Application backward-compatibility check for PR #{{.Number}}. For each migration file changed in this PR, identify every table and column being modified, dropped, renamed, or retyped. Then for each such schema element: 1. Search the application source tree (excluding the migration dir itself) for references to that table/column. Use `git grep` and look for ORM models, raw SQL strings, query builders, and any generated code (e.g., sqlc, ent, prisma client). 2. If the column is being **dropped or renamed**, but live code still references it, this breaks the rolling-deploy invariant. List every file:line that needs to change first. 3. If the column is being **retyped**, check whether existing code assumes the old type's range / nullability / encoding. 4. For new NOT NULL columns, check whether existing INSERTs in application code provide a value. Post the findings as a PR comment with this structure: | Schema element | Change | App references | Compatible? | Action | Recommend whether the migration can ship in this PR or must be split into a backward-compatible sequence (expand β migrate β contract). Do NOT modify files β analysis only.ttlSecondsAfterFinished: 3600
Example 3: Operational runbook + reverse-migration generator (comment-triggered)
After human reviewers approve the static checks, an authorized reviewer comments /kelos migration-runbook to trigger the runbook stage. This is a write-mode task that pushes additional artifacts to the PR branch.
apiVersion: kelos.dev/v1alpha1kind: TaskSpawnermetadata:
name: db-migration-runbookspec:
when:
githubPullRequests:
state: openfilePatterns:
include:
- "db/migrations/**"
- "migrations/**/*.sql"
- "alembic/versions/**/*.py"commentPolicy:
triggerComment: "/kelos migration-runbook"minimumPermission: writereporting:
enabled: truemaxConcurrency: 1taskTemplate:
type: claude-codeworkspaceRef:
name: my-appcredentials:
type: oauthsecretRef:
name: claude-oauth-tokenbranch: "{{.Branch}}"promptTemplate: | Generate operational artifacts for the migration in PR #{{.Number}}. For each new migration file in this PR, produce alongside it (e.g., `001_add_email_index.runbook.md`): ## Operational runbook - Pre-flight checks (table size, current locks, replica lag) - Recommended execution window (consider blocking impact) - Rough timing estimate based on table size β query `pg_stat_user_tables` / `INFORMATION_SCHEMA.TABLES` if a connection string secret is mounted, otherwise estimate by order of magnitude - Step-by-step rollout (e.g., expand β backfill β contract phases) - Rollback procedure (with exact SQL) - Monitoring signals to watch (lock waits, replication lag, error rates) and abort criteria ## Reverse migration For frameworks that support up/down (golang-migrate, alembic, knex): generate the corresponding `down` migration. Verify it's actually reversible β for irreversible changes (DROP COLUMN/TABLE), produce the reverse as a comment with the explicit warning that data loss will occur and a backup is required. Commit the runbook and reverse migration to the PR branch and push. Add a PR comment summarizing what was generated and where reviewers should focus their final attention.ttlSecondsAfterFinished: 7200
Example 4: Schema-drift sentinel (cron-scheduled)
Independent of any PR. Periodically diffs the desired schema (from migration files / schema.sql dump) against applied schemas in non-prod environments to catch drift before it causes a deploy failure.
apiVersion: kelos.dev/v1alpha1kind: TaskSpawnermetadata:
name: db-schema-drift-sentinelspec:
when:
cron:
schedule: "0 6 * * *"# Daily at 06:00 UTCmaxConcurrency: 1taskTemplate:
type: claude-codeworkspaceRef:
name: my-appcredentials:
type: oauthsecretRef:
name: claude-oauth-tokenbranch: mainpodOverrides:
env:
- name: STAGING_DB_URLvalueFrom:
secretKeyRef:
name: staging-readonly-dbkey: DATABASE_URL
- name: PROD_DB_URLvalueFrom:
secretKeyRef:
name: prod-readonly-dbkey: DATABASE_URLpromptTemplate: | Detect schema drift between the migration history in this repo and the live schemas of staging and production. 1. Determine the database engine and reconstruct the desired schema from the migration files (use `atlas schema inspect`, `pg_dump --schema-only`, or `mysqldump --no-data` against a temporary throwaway DB after applying migrations β or use the framework's own dry-run feature). 2. For each environment (staging then prod), connect read-only and dump the live schema. 3. Run a structural diff. Flag ONLY meaningful divergence β ignore column ordering, comments, default value formatting differences. 4. For each divergence, classify: - "drift: applied locally, missing in repo" β an out-of-band hand-edit - "drift: in repo, missing in env" β a migration was skipped - "version skew: env has older head than repo HEAD" 5. If any divergence is found, open a single GitHub issue titled "Schema drift detected: <env>" with the diff and recommended remediation. Skip if a similar open issue already exists. Label it `db/drift`. If no drift, exit without creating any issue (do not create empty 'all clear' issues). Do NOT modify the live database under any circumstance.ttlSecondsAfterFinished: 3600
Patterns the agent should detect
A non-exhaustive checklist the prompt can be tightened against over time. This is the "expert knowledge" most teams want captured in tooling:
Pattern
Engine
Severity
Safer alternative
ALTER TABLE ADD COLUMN NOT NULL without DEFAULT on big table
Postgres β€10 / MySQL
HIGH
Add nullable β backfill β SET NOT NULL
ALTER TABLE ADD COLUMN ... DEFAULT volatile_func()
Postgres β₯11
HIGH
Use immutable default or split
CREATE INDEX (not CONCURRENTLY)
Postgres
HIGH
CREATE INDEX CONCURRENTLY
ALTER TABLE ADD FOREIGN KEY without NOT VALID
Postgres
MEDIUM
ADD ... NOT VALID; VALIDATE CONSTRAINT;
ALTER COLUMN TYPE requiring rewrite
Postgres
HIGH
New column + backfill + swap
DROP COLUMN while app still references
Any
BLOCKER
Stop reading first β release β drop next deploy
Distinct. That issue covers application-code upgrades (React 18β19, Go 1.21β1.23). This issue covers database DDL migrations β a fundamentally different problem (lock acquisition, online operations, replication).
Complementary. If a deploy fails because of a schema-drift mismatch, #946's auto-remediation could open a fix PR; this proposal prevents the failure upstream.
Composable. After a schema migration ships in the data-layer repo, a downstream propagation could regenerate ORM clients in consumer repos.
Why This Matters
Universally applicable. Every team that owns a database has migrations. Unlike many proposed use cases that target a specific subdomain, this one applies to virtually all of Kelos's potential audience.
Senior-engineer leverage. Migration review is a recurring high-attention task that senior engineers and DBAs do manually today. Automating the first pass with an AI agent and reserving humans for judgment calls is a textbook leverage win.
Strong demo material. A PR comment that says "BLOCKER: this ALTER TABLE β¦ TYPE will rewrite a 50M-row table; here's the safer 3-migration sequence" is a vivid, instantly-comprehensible demonstration of Kelos's value.
Low blast radius. The reviewer stages are read-only. The runbook stage is human-gated. The drift sentinel reads from non-prod first. Failure modes are bounded β the worst case is a noisy PR comment.
Add an examples/13-taskspawner-db-migration-review/ directory containing Examples 1β2 above plus a README explaining the antipattern catalog and how to extend it for the team's specific engine. Examples 3 and 4 can land in subsequent iterations once the basic pattern proves out.
π€ Kelos Strategist Agent @gjkim42
Summary
Propose a new high-value use case for Kelos: a database schema migration safety and lifecycle pipeline. Bad SQL/DDL migrations remain one of the top causes of production incidents (lock contention, accidentally non-backward-compatible changes, slow online operations, missing rollback paths) β yet review and operational planning are still done manually by senior engineers and DBAs. Kelos is uniquely positioned to automate this because (a) the safe-migration body of knowledge is a strong fit for AI synthesis, (b) the inputs are tightly bounded (DDL/migration files), and (c) the existing primitives β
githubPullRequestswithfilePatterns,commentPolicy,dependsOn, andcronβ already compose into the right multi-stage workflow. Critically, this is distinct from existing proposals: #926 covers framework/library version migrations (React 18β19 etc.) β application-code upgrades, not database DDL.Problem
Database migrations are a top cause of production incidents
Industry surveys consistently rank schema changes among the leading causes of database-related outages. The failure modes are well-known but pervasive:
ALTER TABLE ADD COLUMN NOT NULLrewrites the whole table on most engines; on a 50M-row table this means minutes-to-hours of downtime. Adding indexes withoutCONCURRENTLY(Postgres) orALGORITHM=INPLACE(MySQL) blocks writes.DROP COLUMNcannot be undone without backup. Many migration frameworks supportdown()migrations that are silently absent or wrong.UPDATE ... WHERE big_conditionon millions of rows in a single transaction can blow up WAL/binlog and replication lag.CREATE INDEXhistorically) implicitly commit.ADD FOREIGN KEYwithoutNOT VALID + VALIDATE(Postgres) takes aShareRowExclusivelock and validates the entire table inline.Why existing tooling falls short
squawk,pglint,dba-cliβ Static linters catch a known subset of antipatterns but cannot reason about repo state (e.g., "is this column used ininternal/api/handler.go?"), application backward compatibility, or operational impact.atlas,dbmate,goose,flyway,alembic,knex, golang-migrate β Apply migrations but do not analyze risk or generate reverse migrations.gh-ost,pt-online-schema-changeβ Execute online operations safely but require human authoring of the target schema and operational plan.An autonomous agent can do something static tools can't: read the migration, read application code that references the affected tables/columns, read the
git logof past migrations for conventions, read the engine version from CI config, and synthesize a structured risk assessment plus a concrete operational runbook.Proposed Solution
Workflow Overview
Each stage is a separate Kelos Task. Stages 1β2 run automatically per PR. Stage 3 is human-gated via
commentPolicy. Stage 4 runs on acronschedule independent of any PR.Example 1: Static migration safety reviewer (PR-triggered)
Targets PRs that touch migration files, regardless of label. Posts a structured risk assessment as a PR comment.
Example 2: Application backward-compatibility cross-check (PR-triggered)
Runs in parallel with Example 1 but focuses specifically on application code references β a complementary signal to pure DDL analysis.
Example 3: Operational runbook + reverse-migration generator (comment-triggered)
After human reviewers approve the static checks, an authorized reviewer comments
/kelos migration-runbookto trigger the runbook stage. This is a write-mode task that pushes additional artifacts to the PR branch.Example 4: Schema-drift sentinel (cron-scheduled)
Independent of any PR. Periodically diffs the desired schema (from migration files /
schema.sqldump) against applied schemas in non-prod environments to catch drift before it causes a deploy failure.Patterns the agent should detect
A non-exhaustive checklist the prompt can be tightened against over time. This is the "expert knowledge" most teams want captured in tooling:
ALTER TABLE ADD COLUMN NOT NULLwithout DEFAULT on big tableALTER TABLE ADD COLUMN ... DEFAULT volatile_func()CREATE INDEX(not CONCURRENTLY)CREATE INDEX CONCURRENTLYALTER TABLE ADD FOREIGN KEYwithout NOT VALIDADD ... NOT VALID; VALIDATE CONSTRAINT;ALTER COLUMN TYPErequiring rewriteDROP COLUMNwhile app still referencesUPDATEKelos Features Leveraged
githubPullRequestssourcefilePatterns.include/exclude(#778)commentPolicy.triggerComment+minimumPermissionreporting.enabledcronsourcedependsOn(Task pipeline)podOverrides.env+valueFrom.secretKeyRefmaxConcurrencyAgentConfig(optional)Relationship to Existing Proposals
*.runbook.mdfile before marking the Task succeeded.Why This Matters
ALTER TABLE β¦ TYPEwill rewrite a 50M-row table; here's the safer 3-migration sequence" is a vivid, instantly-comprehensible demonstration of Kelos's value.filePatterns(API: Add filePatterns filter and ChangedFiles enrichment to githubPullRequests for content-aware task routingΒ #778),commentPolicy, and the existing PR-source tooling. No new CRD or controller change required to ship the first iteration as an example underexamples/.Suggested First Step
Add an
examples/13-taskspawner-db-migration-review/directory containing Examples 1β2 above plus a README explaining the antipattern catalog and how to extend it for the team's specific engine. Examples 3 and 4 can land in subsequent iterations once the basic pattern proves out.