Skip to content

SQL API: Add schema-aware array-rows response + queryRef/schemaRef caching (Kalam-link friendly) #50

@jamals86

Description

@jamals86

Problem

Today /v1/api/sql returns result rows as JSON objects plus a columns list, and the “typed” mode wraps each value. This makes it hard for Kalam-link to:

  • reliably understand the returned schema (types/nullability/order)
  • decode efficiently (objects are heavier than arrays)
  • cache decoding metadata safely across requests (schema drift vs query hash)

Relevant code:

  • Request model: sql_request.rs
  • Response model: sql_response.rs
  • Handler: sql_handler.rs
  • ScalarValue→JSON conversions: arrow_json_conversion.rs

Goals

  1. Make the REST SQL response self-describing (Arrow schema), so clients know exact output types/order.
  2. Make the payload efficient (rows as arrays, stable column order).
  3. Enable safe client+server caching:
    • Server caches “prepared query template” (queryRef)
    • Client caches schema decoding (schemaRef → schema)
  4. Keep ScalarValue→JSON conversion logic centralized (no duplicate conversion paths).
  5. Return a schema with datatype of: KalamDataType

Non-goals (for this issue)

  • Changing WebSocket payload formats (can follow later, but this issue targets REST /v1/api/sql).
  • Introducing a full GraphQL layer or persisted-query registry UI.

Proposed API changes

New request fields (backward-compatible)

Extend the existing request model with optional fields:

  • queryRef?: { version: number, sha256: string }
  • sql?: string (already exists; allow either sql OR queryRef)
  • params?: JsonValue[] (already exists)
  • includeSchema?: boolean (default false when client sends knownSchemaRefs, else true on first query)
  • knownSchemaRefs?: string[] (client tells server which schema hashes it already has)
  • resultFormat?: "array_rows" | "object_rows" (default object_rows for backward compat; Kalam-link uses array_rows)

Notes:

  • Keep serialization_mode as-is (simple vs typed value encoding), but do not change row shape by mode.
  • Multi-statement + params stays rejected (current behavior), unless expanded later.

New response fields (for SELECT-like results)

For each QueryResult (or a new QueryResultV2):

  • queryRef?: { version: number, sha256: string } (returned when request used sql)
  • schemaRef?: { version: number, sha256: string } (always returned for row results)
  • schema?: { fields: [...] , metadata?: ... } (returned only when includeSchema=true OR client doesn’t have schemaRef)
  • rows: array of rows
    • when resultFormat="array_rows": rows: JsonValue[][]
    • when resultFormat="object_rows": keep current rows: object[]
  • Deprecate columns when schema is present (or keep temporarily but treat as redundant)

Optional endpoint (recommended for robustness)

  • GET /v1/api/schema/{schemaSha256}
    • Returns { schemaRef, schema }
    • Lets clients recover schema without re-running the query.

Hashing / caching semantics (critical)

schemaRef (correctness)

  • schemaRef.sha256 = sha256(canonical_arrow_schema_json)
  • Canonicalization requirements:
    • stable field order
    • include name, data_type, nullable (and ideally metadata)
  • Must be computed from the output schema of the query (post-projection), not just table schema.

queryRef (server perf)

Used to let the client send only {queryRef, params} on subsequent calls.

  • queryRef.sha256 should be derived from at least:
    • normalized SQL template (no params embedded)
    • relevant session settings that can affect planning/schema
    • auth/tenant scope key (so we never reuse a plan across incompatible permission contexts)

A safe starting scope key could include: user role + namespace_id + any “current user” context.

Also define server cache behavior:

  • LRU + TTL (avoid unbounded memory) use Moka
  • Why: async-safe, TTL + size-based eviction, lock-free reads
  • Cache entries invalidated on schemaRef change, or simply allow schemaRef mismatch to trigger schema resend.

Concrete examples (end-to-end)

Step 1: First call (client only knows SQL)

Request:

POST /v1/api/sql
{
  "sql": "select * from chat.messages where conversation_id = $1",
  "params": ["conv_123"],
  "serialization_mode": "simple",
  "resultFormat": "array_rows",
  "includeSchema": true
}

Response:

200 OK
{
  "status": "success",
  "results": [
    {
      "queryRef": { "version": 1, "sha256": "q_7a1b...e21" },
      "schemaRef": { "version": 1, "sha256": "s_55c9...91a" },
      "schema": {
        "fields": [
          {"name":"msg_id","data_type":"Int64","nullable":false},
          {"name":"conversation_id","data_type":"Utf8","nullable":false},
          {"name":"from","data_type":"Utf8","nullable":false},
          {"name":"timestamp","data_type":"Timestamp(Microsecond, None)","nullable":false},
          {"name":"content","data_type":"Utf8","nullable":true},
          {"name":"metadata","data_type":"Utf8","nullable":true}
        ]
      },
      "rows": [
        ["123456789", "conv_123", "alice", 1699000000000000, "Hello", null]
      ],
      "row_count": 1
    }
  ],
  "took": 15.0
}

Step 2: Subsequent call (client sends queryRef + params only)

Request:

POST /v1/api/sql
{
  "queryRef": { "version": 1, "sha256": "q_7a1b...e21" },
  "params": ["conv_999"],
  "serialization_mode": "simple",
  "resultFormat": "array_rows",
  "knownSchemaRefs": ["s_55c9...91a"]
}

Response (schema omitted, but schemaRef included):

200 OK
{
  "status": "success",
  "results": [
    {
      "schemaRef": { "version": 1, "sha256": "s_55c9...91a" },
      "rows": [
        ["123456790", "conv_999", "bob", 1699000000001234, "Hi", null]
      ],
      "row_count": 1
    }
  ],
  "took": 3.2
}

Step 3: Schema drift (ALTER TABLE / view change / SELECT * expansion)

Client sends same request as Step 2; server detects schemaRef changed and returns schema again:

200 OK
{
  "status": "success",
  "results": [
    {
      "schemaRef": { "version": 1, "sha256": "s_a912...44f" },
      "schema": {
        "fields": [
          {"name":"msg_id","data_type":"Int64","nullable":false},
          "...",
          {"name":"edited_at","data_type":"Timestamp(Microsecond, None)","nullable":true}
        ]
      },
      "rows": [
        ["123456789", "conv_123", "alice", 1699000000000000, "Hello", null, null]
      ],
      "row_count": 1
    }
  ],
  "took": 3.4
}

Important considerations / pitfalls

  • Do not key server queryRef cache only by SQL text; it must be scoped to avoid cross-user leakage and schema mismatches.
  • schemaRef should be based on output schema, not physical table schema.
  • Current masking logic in the handler should be reviewed for array row format (masking by column name still possible, but row shape changes).
  • Keep ScalarValue→JSON conversion centralized (extend conversion utilities rather than adding one-off conversions).

Implementation sketch (high level)

  1. Add new response types/fields in sql_response.rs
    • Support rows as either array-rows or object-rows (feature-flagged by resultFormat)
    • Add schemaRef, optional schema, optional queryRef
  2. Extend request model in sql_request.rs
  3. Add schema serialization helper + sha256 hashing
    • Canonical Arrow schema JSON + sha256
  4. Extend conversion in arrow_json_conversion.rs
    • Add RecordBatch → array-rows conversion that still calls the single ScalarValue→JSON function
  5. Update handler logic in sql_handler.rs
    • If request has queryRef, resolve cached SQL/plan; else compute and return queryRef
    • Always attach schemaRef; include schema only if requested/needed
  6. Add tests
    • schemaRef stable for same schema
    • schema omitted when knownSchemaRefs contains it
    • schema included when schema changes

Acceptance criteria

  • REST SQL endpoint supports resultFormat="array_rows" with stable ordering matching Arrow schema fields.
  • Response includes schemaRef for row-returning statements.
  • Response includes schema only when needed (first call / cache miss / schema drift).
  • Client can send queryRef + params without sending SQL on subsequent calls.
  • No duplicate ScalarValue→JSON conversion logic introduced.
  • Backward compatibility preserved (existing clients can keep using object rows).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions