Skip to content

Tracking: Strongly-typed field/column indexing refactor #19429

@dantengsky

Description

@dantengsky

Summary

Databend currently mixes several similar-but-different concepts in the codebase:

  • field vs column (especially when nested/compound types exist)
  • stable identifier (ColumnId) vs ordinal/index/position (usize)
  • external/subsystem identifiers (Parquet/Iceberg metadata, Tantivy field id, index-file ordinals) vs Databend’s internal ids

This makes code harder to read and maintain, and it increases the risk of subtle bugs such as treating an ordinal as a stable id (e.g. idx as ColumnId), or passing usize through long call chains where the meaning changes over time.

This tracking issue documents the motivation, proposed terminology, and a staged roadmap to refactor field/column indexing toward stronger typing and explicit boundaries. The plan will be adjusted pragmatically based on real code constraints and review feedback.

Motivation / Why

  • Prevent semantic bugs: ColumnId is a stable identifier; many usize values are ordinals that are not stable across projection/reordering/schema evolution.
  • Nested types amplify ambiguity: a logical field may be a compound type, while storage/index layers often operate on leaf/physical columns.
  • Cross-boundary safety: Parquet/Iceberg/Tantivy/index-files have their own ids/ordinals. Assuming they are “the same as ColumnId” without explicit mapping is fragile.

Proposed terminology

  • Field: an entry in TableSchema.fields (TableField). May represent a nested/compound logical column.
  • Leaf column: a physical/leaf column used in storage formats and block representations (often derived from fields via expansion).
  • Stable ID: an identifier that remains stable across schema evolution (e.g. ColumnId(u32)).
  • Ordinal / index / position: a usize-like index into a specific vector/view (TableSchema.fields, DataSchema.fields, a bloom-index column list, etc.). Not stable.

Proposed strong types (examples)

The goal is to use newtypes to make intent explicit and prevent accidental mixing:

  • ColumnId(u32): stable id (keep existing semantics/wire format)
  • TableFieldIndex(usize): index into TableSchema.fields
  • DataFieldIndex(usize): index into DataSchema.fields (often equals DataBlock column position)
  • LeafColumnIndex(usize): index into a leaf-column view (explicit “leaf” meaning)
  • ParquetFieldId(u32): field id stored in parquet/arrow metadata ("PARQUET:field_id")
  • InvertedIndexFieldId(u32): Tantivy schema field id (subsystem id)
  • BloomIndexColumnOrdinal(usize): bloom index column ordinal (subsystem ordinal)

Key rule: external/subsystem ids must be mapped explicitly (no implicit equivalence).

Goals

  • Make ColumnId vs ordinals unambiguous at the type level.
  • Replace “semantic usize” plumbing with typed indices and typed schema helper APIs.
  • Make Parquet/Iceberg/Tantivy/index-file boundaries explicit via conversion helpers.
  • Keep changes incremental and reviewable (small PRs), avoiding a “big bang” refactor.
  • Maintain compatibility where needed (e.g. for serialized plans), using serde(rename = "...") when renaming fields for clarity.

Non-goals

  • No global redefinition of all ids (e.g. no forced newtype for ColumnId across the entire repo in one shot).
  • No meta/proto/wire format changes.
  • No intentional behavior changes unless an existing bug is discovered and covered by tests.

Refactoring principles

  • Eliminate anti-patterns:
    • idx as ColumnId
    • variables named field_id that actually carry an ordinal
    • u32/usize casts that cross subsystem boundaries without explicit meaning
  • Prefer typed APIs (incrementally): e.g. field_at(TableFieldIndex), index_of_field(...) -> TableFieldIndex, project_typed(...).
  • Put subsystem types near the subsystem; keep broadly reusable types in shared crates.

Roadmap (staged PR series)

Exact slicing may change based on dependencies and review feedback.

  1. Foundation

    • Introduce core newtypes + conversion helpers.
    • Add typed schema helper APIs while keeping existing APIs.
  2. Indexing subsystems

    • Bloom index: replace “ordinal stored in ColumnId” patterns with an explicit ordinal type.
    • Inverted index: wrap Tantivy field id in a dedicated type and stop leaking raw u32/usize.
  3. Storage format boundaries

    • Parquet/Iceberg: treat "PARQUET:field_id" as ParquetFieldId, and require explicit mapping to/from ColumnId.
  4. Planner/executor cleanup

    • Rename misleading variables/maps (field_idtable_field_index where appropriate).
    • Adopt typed indices in hot paths (e.g. mutation pipeline) with compatibility preserved via serde rename if needed.
  5. Follow-ups (optional)

    • Doc: a short developer note describing “field vs leaf column” and “stable id vs ordinal”, plus common pitfalls.
    • Additional sweeps across remaining modules.

Tracking checklist

  • Foundation: core newtypes + typed schema APIs
  • Bloom index: remove ordinal-as-ColumnId
  • Inverted index: strong type Tantivy field id
  • Parquet/Iceberg boundary: explicit ParquetFieldId mapping
  • Planner/executor: rename + typed indices adoption
  • Optional: documentation + broader sweeps

Open questions

  • Do we need a dedicated type to distinguish DataSchema index vs DataBlock column position (when they can diverge)?
  • For nested fields, should we introduce a typed ColumnPath (instead of passing Vec<String> / name:subname strings around)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions