Tracking: Strongly-typed field/column indexing refactor

## Summary

Databend currently mixes several similar-but-different concepts in the codebase:

- `field` vs `column` (especially when nested/compound types exist)
- stable identifier (`ColumnId`) vs ordinal/index/position (`usize`)
- external/subsystem identifiers (Parquet/Iceberg metadata, Tantivy field id, index-file ordinals) vs Databend’s internal ids

This makes code harder to read and maintain, and it increases the risk of subtle bugs such as treating an ordinal as a stable id (e.g. `idx as ColumnId`), or passing `usize` through long call chains where the meaning changes over time.

This tracking issue documents the motivation, proposed terminology, and a staged roadmap to refactor field/column indexing toward stronger typing and explicit boundaries. The plan will be adjusted pragmatically based on real code constraints and review feedback.

## Motivation / Why

- **Prevent semantic bugs**: `ColumnId` is a stable identifier; many `usize` values are ordinals that are not stable across projection/reordering/schema evolution.
- **Nested types amplify ambiguity**: a logical field may be a compound type, while storage/index layers often operate on leaf/physical columns.
- **Cross-boundary safety**: Parquet/Iceberg/Tantivy/index-files have their own ids/ordinals. Assuming they are “the same as `ColumnId`” without explicit mapping is fragile.

## Proposed terminology

- **Field**: an entry in `TableSchema.fields` (`TableField`). May represent a nested/compound logical column.
- **Leaf column**: a physical/leaf column used in storage formats and block representations (often derived from fields via expansion).
- **Stable ID**: an identifier that remains stable across schema evolution (e.g. `ColumnId(u32)`).
- **Ordinal / index / position**: a `usize`-like index into a specific vector/view (`TableSchema.fields`, `DataSchema.fields`, a bloom-index column list, etc.). Not stable.

## Proposed strong types (examples)

The goal is to use newtypes to make intent explicit and prevent accidental mixing:

- `ColumnId(u32)`: stable id (keep existing semantics/wire format)
- `TableFieldIndex(usize)`: index into `TableSchema.fields`
- `DataFieldIndex(usize)`: index into `DataSchema.fields` (often equals DataBlock column position)
- `LeafColumnIndex(usize)`: index into a leaf-column view (explicit “leaf” meaning)
- `ParquetFieldId(u32)`: field id stored in parquet/arrow metadata (`"PARQUET:field_id"`)
- `InvertedIndexFieldId(u32)`: Tantivy schema field id (subsystem id)
- `BloomIndexColumnOrdinal(usize)`: bloom index column ordinal (subsystem ordinal)

Key rule: **external/subsystem ids must be mapped explicitly** (no implicit equivalence).

## Goals

- Make `ColumnId` vs ordinals unambiguous at the type level.
- Replace “semantic `usize`” plumbing with typed indices and typed schema helper APIs.
- Make Parquet/Iceberg/Tantivy/index-file boundaries explicit via conversion helpers.
- Keep changes incremental and reviewable (small PRs), avoiding a “big bang” refactor.
- Maintain compatibility where needed (e.g. for serialized plans), using `serde(rename = "...")` when renaming fields for clarity.

## Non-goals

- No global redefinition of all ids (e.g. no forced newtype for `ColumnId` across the entire repo in one shot).
- No meta/proto/wire format changes.
- No intentional behavior changes unless an existing bug is discovered and covered by tests.

## Refactoring principles

- **Eliminate anti-patterns**:
  - `idx as ColumnId`
  - variables named `field_id` that actually carry an ordinal
  - `u32/usize` casts that cross subsystem boundaries without explicit meaning
- **Prefer typed APIs** (incrementally): e.g. `field_at(TableFieldIndex)`, `index_of_field(...) -> TableFieldIndex`, `project_typed(...)`.
- **Put subsystem types near the subsystem**; keep broadly reusable types in shared crates.

## Roadmap (staged PR series)

> Exact slicing may change based on dependencies and review feedback.

1) **Foundation**
   - Introduce core newtypes + conversion helpers.
   - Add typed schema helper APIs while keeping existing APIs.

2) **Indexing subsystems**
   - Bloom index: replace “ordinal stored in `ColumnId`” patterns with an explicit ordinal type.
   - Inverted index: wrap Tantivy field id in a dedicated type and stop leaking raw `u32/usize`.

3) **Storage format boundaries**
   - Parquet/Iceberg: treat `"PARQUET:field_id"` as `ParquetFieldId`, and require explicit mapping to/from `ColumnId`.

4) **Planner/executor cleanup**
   - Rename misleading variables/maps (`field_id` → `table_field_index` where appropriate).
   - Adopt typed indices in hot paths (e.g. mutation pipeline) with compatibility preserved via serde rename if needed.

5) **Follow-ups (optional)**
   - Doc: a short developer note describing “field vs leaf column” and “stable id vs ordinal”, plus common pitfalls.
   - Additional sweeps across remaining modules.

## Tracking checklist

- [ ] Foundation: core newtypes + typed schema APIs
- [ ] Bloom index: remove ordinal-as-`ColumnId`
- [ ] Inverted index: strong type Tantivy field id
- [ ] Parquet/Iceberg boundary: explicit `ParquetFieldId` mapping
- [ ] Planner/executor: rename + typed indices adoption
- [ ] Optional: documentation + broader sweeps

## Open questions

- Do we need a dedicated type to distinguish `DataSchema` index vs `DataBlock` column position (when they can diverge)?
- For nested fields, should we introduce a typed `ColumnPath` (instead of passing `Vec<String>` / `name:subname` strings around)?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking: Strongly-typed field/column indexing refactor #19429

Summary

Motivation / Why

Proposed terminology

Proposed strong types (examples)

Goals

Non-goals

Refactoring principles

Roadmap (staged PR series)

Tracking checklist

Open questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tracking: Strongly-typed field/column indexing refactor #19429

Description

Summary

Motivation / Why

Proposed terminology

Proposed strong types (examples)

Goals

Non-goals

Refactoring principles

Roadmap (staged PR series)

Tracking checklist

Open questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions