Skip to content

Extract Lucene query tests into engine-agnostic JSON test suite#2

Draft
Copilot wants to merge 1 commit into
mainfrom
copilot/test-query-logic-accuracy
Draft

Extract Lucene query tests into engine-agnostic JSON test suite#2
Copilot wants to merge 1 commit into
mainfrom
copilot/test-query-logic-accuracy

Conversation

Copy link
Copy Markdown

Copilot AI commented May 15, 2026

Extracts core Lucene search query tests into a portable, implementation-independent JSON format — decoupling the semantic contract (given documents + query → expected results) from Java/Lucene internals. Designed to validate a Rust-based engine implementation against the same behavioral spec.

Format

Each test file references a reusable dataset + schema, defines queries, and asserts on results:

{
  "dataset": "fuzzy-words",
  "schema": "single-keyword-field",
  "tests": [
    {
      "id": "fuzzy-ordering-bbbbb",
      "description": "Results ordered by edit distance: bbbbb(0), abbbb(1), aabbb(2)",
      "query": {
        "type": "fuzzy",
        "field": "field",
        "value": "bbbbb",
        "max_edits": 2,
        "prefix_length": 0
      },
      "expected": {
        "count": 3,
        "ordered": ["bbbbb", "abbbb", "aabbb"]
      }
    }
  ]
}

What's included

  • search-test-suite/ — self-contained at repo root
  • 14 datasets, 8 schemas — reusable across test files
  • 70 test cases across 23 files covering 10 query types: term, boolean, phrase, fuzzy, prefix, wildcard, range, regexp, match_all, match_none
  • JSON Schema (schema.json) for structural validation of test files
  • Java reference runnerSearchTestSuiteRunner.java executes the suite against Lucene to verify spec correctness
  • Rust runner placeholder with type definitions and implementation guide

Extraction criteria

Only tests following the index docs → query → assert hits pattern were extracted. Skipped: equals/hashCode, rewrite internals, scorer/weight, randomized stress tests, codec-specific tests — anything coupled to Lucene implementation details rather than search semantics.

Expected results support

count (exact), count_min (lower bound), ordered (strict relevance order), hits.must_contain / hits.must_not_contain (set membership), match_field (which stored field to check).

…format

Extract Lucene core query tests into engine-agnostic JSON test suite with:
- 14 reusable datasets covering all query types
- 8 field schemas (keyword, text, positions)
- 70 test cases across 10 query types in 23 test files
- JSON Schema for validation
- Java reference test runner
- Rust runner placeholder with implementation guide

Agent-Logs-Url: https://github.com/infinilabs/lucene/sessions/38e89113-7701-4f8d-ad64-20e8ae5d87ca

Co-authored-by: medcl <64487+medcl@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants