-
Notifications
You must be signed in to change notification settings - Fork 852
Description
Summary
Databend currently mixes several similar-but-different concepts in the codebase:
fieldvscolumn(especially when nested/compound types exist)- stable identifier (
ColumnId) vs ordinal/index/position (usize) - external/subsystem identifiers (Parquet/Iceberg metadata, Tantivy field id, index-file ordinals) vs Databend’s internal ids
This makes code harder to read and maintain, and it increases the risk of subtle bugs such as treating an ordinal as a stable id (e.g. idx as ColumnId), or passing usize through long call chains where the meaning changes over time.
This tracking issue documents the motivation, proposed terminology, and a staged roadmap to refactor field/column indexing toward stronger typing and explicit boundaries. The plan will be adjusted pragmatically based on real code constraints and review feedback.
Motivation / Why
- Prevent semantic bugs:
ColumnIdis a stable identifier; manyusizevalues are ordinals that are not stable across projection/reordering/schema evolution. - Nested types amplify ambiguity: a logical field may be a compound type, while storage/index layers often operate on leaf/physical columns.
- Cross-boundary safety: Parquet/Iceberg/Tantivy/index-files have their own ids/ordinals. Assuming they are “the same as
ColumnId” without explicit mapping is fragile.
Proposed terminology
- Field: an entry in
TableSchema.fields(TableField). May represent a nested/compound logical column. - Leaf column: a physical/leaf column used in storage formats and block representations (often derived from fields via expansion).
- Stable ID: an identifier that remains stable across schema evolution (e.g.
ColumnId(u32)). - Ordinal / index / position: a
usize-like index into a specific vector/view (TableSchema.fields,DataSchema.fields, a bloom-index column list, etc.). Not stable.
Proposed strong types (examples)
The goal is to use newtypes to make intent explicit and prevent accidental mixing:
ColumnId(u32): stable id (keep existing semantics/wire format)TableFieldIndex(usize): index intoTableSchema.fieldsDataFieldIndex(usize): index intoDataSchema.fields(often equals DataBlock column position)LeafColumnIndex(usize): index into a leaf-column view (explicit “leaf” meaning)ParquetFieldId(u32): field id stored in parquet/arrow metadata ("PARQUET:field_id")InvertedIndexFieldId(u32): Tantivy schema field id (subsystem id)BloomIndexColumnOrdinal(usize): bloom index column ordinal (subsystem ordinal)
Key rule: external/subsystem ids must be mapped explicitly (no implicit equivalence).
Goals
- Make
ColumnIdvs ordinals unambiguous at the type level. - Replace “semantic
usize” plumbing with typed indices and typed schema helper APIs. - Make Parquet/Iceberg/Tantivy/index-file boundaries explicit via conversion helpers.
- Keep changes incremental and reviewable (small PRs), avoiding a “big bang” refactor.
- Maintain compatibility where needed (e.g. for serialized plans), using
serde(rename = "...")when renaming fields for clarity.
Non-goals
- No global redefinition of all ids (e.g. no forced newtype for
ColumnIdacross the entire repo in one shot). - No meta/proto/wire format changes.
- No intentional behavior changes unless an existing bug is discovered and covered by tests.
Refactoring principles
- Eliminate anti-patterns:
idx as ColumnId- variables named
field_idthat actually carry an ordinal u32/usizecasts that cross subsystem boundaries without explicit meaning
- Prefer typed APIs (incrementally): e.g.
field_at(TableFieldIndex),index_of_field(...) -> TableFieldIndex,project_typed(...). - Put subsystem types near the subsystem; keep broadly reusable types in shared crates.
Roadmap (staged PR series)
Exact slicing may change based on dependencies and review feedback.
-
Foundation
- Introduce core newtypes + conversion helpers.
- Add typed schema helper APIs while keeping existing APIs.
-
Indexing subsystems
- Bloom index: replace “ordinal stored in
ColumnId” patterns with an explicit ordinal type. - Inverted index: wrap Tantivy field id in a dedicated type and stop leaking raw
u32/usize.
- Bloom index: replace “ordinal stored in
-
Storage format boundaries
- Parquet/Iceberg: treat
"PARQUET:field_id"asParquetFieldId, and require explicit mapping to/fromColumnId.
- Parquet/Iceberg: treat
-
Planner/executor cleanup
- Rename misleading variables/maps (
field_id→table_field_indexwhere appropriate). - Adopt typed indices in hot paths (e.g. mutation pipeline) with compatibility preserved via serde rename if needed.
- Rename misleading variables/maps (
-
Follow-ups (optional)
- Doc: a short developer note describing “field vs leaf column” and “stable id vs ordinal”, plus common pitfalls.
- Additional sweeps across remaining modules.
Tracking checklist
- Foundation: core newtypes + typed schema APIs
- Bloom index: remove ordinal-as-
ColumnId - Inverted index: strong type Tantivy field id
- Parquet/Iceberg boundary: explicit
ParquetFieldIdmapping - Planner/executor: rename + typed indices adoption
- Optional: documentation + broader sweeps
Open questions
- Do we need a dedicated type to distinguish
DataSchemaindex vsDataBlockcolumn position (when they can diverge)? - For nested fields, should we introduce a typed
ColumnPath(instead of passingVec<String>/name:subnamestrings around)?