infinilabs · Copilot · May 15, 2026
diff --git a/search-test-suite/README.md b/search-test-suite/README.md
@@ -0,0 +1,159 @@
+# Search Test Suite
+
+A portable, engine-agnostic test specification format for validating search engine query implementations.
+
+## Overview
+
+This test suite captures the **semantic contract** of core search query operations — datasets, schemas, queries, and expected results — in a JSON format that can be consumed by any search engine implementation (Lucene, Elasticsearch, a Rust-based engine, etc.).
+
+Tests are extracted from Apache Lucene's core query test suite but are deliberately **implementation-independent**: no Java types, no scorer internals, no codec details.
+
+## Directory Structure
+
+```
+search-test-suite/
+├── README.md              # This file
+├── schema.json            # JSON Schema for validating test files
+├── datasets/              # Reusable document datasets
+├── schemas/               # Reusable field/index schemas
+└── tests/                 # Test suites, one folder per query type
+    ├── term_query/
+    ├── boolean_query/
+    ├── phrase_query/
+    ├── fuzzy_query/
+    ├── prefix_query/
+    ├── wildcard_query/
+    ├── range_query/
+    ├── regexp_query/
+    ├── match_all_query/
+    └── match_none_query/
+```
+
+## Format Specification
+
+### Dataset Files
+
+Each dataset file (`datasets/*.json`) defines a reusable set of documents:
+
+```json
+{
+  "id": "dataset-name",
+  "description": "Human-readable description",
+  "documents": [
+    { "_id": "1", "field_name": "field_value", ... }
+  ]
+}
+```
+
+### Schema Files
+
+Each schema file (`schemas/*.json`) defines the index field mappings:
+
+```json
+{
+  "id": "schema-name",
+  "description": "Human-readable description",
+  "fields": [
+    {
+      "name": "field_name",
+      "type": "text|keyword|integer|long|float|double|boolean|date",
+      "stored": true,
+      "indexed": true,
+      "analyzer": {                    // optional, for "text" fields
+        "tokenizer": "whitespace|standard",
+        "lowercase": false,
+        "position_increment_gap": 100
+      }
+    }
+  ]
+}
+```
+
+**Field Types:**
+- `keyword` — Not analyzed, exact match (equivalent to Lucene `StringField`)
+- `text` — Analyzed/tokenized (equivalent to Lucene `TextField`)
+- `integer`, `long`, `float`, `double` — Numeric types
+- `boolean`, `date` — Other common types
+
+### Test Files
+
+Each test file (`tests/<query_type>/*.json`) defines one or more test cases:
+
+```json
+{
+  "description": "Test suite description",
+  "source": "Original Lucene test class and method",
+  "dataset": "dataset-id",
+  "schema": "schema-id",
+  "tests": [
+    {
+      "id": "unique-test-id",
+      "description": "What this test verifies",
+      "query": { ... },
+      "expected": { ... }
+    }
+  ]
+}
+```
+
+### Query Types
+
+| Type | Key Fields |
+|------|-----------|
+| `term` | `field`, `value` |
+| `boolean` | `clauses[]` with `occur` (must/should/must_not/filter), `min_should_match` |
+| `phrase` | `field`, `terms[]`, `slop` |
+| `fuzzy` | `field`, `value`, `max_edits`, `prefix_length`, `max_expansions` |
+| `prefix` | `field`, `value` |
+| `wildcard` | `field`, `pattern` |
+| `regexp` | `field`, `pattern`, `flags` |
+| `range` | `field`, `lower`, `upper`, `include_lower`, `include_upper` |
+| `match_all` | _(none)_ |
+| `match_none` | `reason` (optional) |
+
+### Expected Results
+
+```json
+{
+  "count": 3,                              // exact total hit count
+  "count_min": 1,                          // minimum hits (for approximate checks)
+  "ordered": ["val1", "val2"],             // strict order by relevance (field values)
+  "hits": {
+    "must_contain": ["val1", "val2"],      // these field values MUST appear in results
+    "must_not_contain": ["val3"]           // these field values MUST NOT appear
+  },
+  "match_field": "field"                   // which stored field to check (default: query field)
+}
+```
+
+## Writing a Test Runner
+
+Any conforming test runner must:
+
+1. **Read** a test JSON file
+2. **Resolve** the referenced dataset and schema files
+3. **Create** an in-memory index from the dataset using the schema's field definitions
+4. **Build** the query from the `query` block
+5. **Execute** the query against the index
+6. **Assert** against the `expected` block:
+   - `count` → exact match on total hits
+   - `count_min` → total hits >= value
+   - `ordered` → strict sequence match on stored field values in result order
+   - `hits.must_contain` → all listed values are found in results
+   - `hits.must_not_contain` → none of the listed values are found in results
+
+## Validation
+
+Use the provided `schema.json` to validate test files:
+
+```bash
+# Using ajv-cli (Node.js)
+npx ajv validate -s schema.json -d "tests/**/*.json"
+
+# Using check-jsonschema (Python)
+check-jsonschema --schemafile schema.json tests/**/*.json
+```
+
+## License
+
+This test suite is derived from Apache Lucene test cases, licensed under the Apache License, Version 2.0.
diff --git a/search-test-suite/datasets/boolean-demorgan.json b/search-test-suite/datasets/boolean-demorgan.json
@@ -0,0 +1,8 @@
+{
+  "id": "boolean-demorgan",
+  "description": "Documents for De Morgan boolean logic tests",
+  "documents": [
+    { "_id": "1", "field": "foo bar" },
+    { "_id": "2", "field": "foo baz" }
+  ]
+}
diff --git a/search-test-suite/datasets/boolean-mixed.json b/search-test-suite/datasets/boolean-mixed.json
@@ -0,0 +1,9 @@
+{
+  "id": "boolean-mixed",
+  "description": "Documents with multiple text fields for boolean query tests",
+  "documents": [
+    { "_id": "1", "key": "one" },
+    { "_id": "2", "key": "two" },
+    { "_id": "3", "key": "three four" }
+  ]
+}
diff --git a/search-test-suite/datasets/fuzzy-basic.json b/search-test-suite/datasets/fuzzy-basic.json
@@ -0,0 +1,7 @@
+{
+  "id": "fuzzy-basic",
+  "description": "Single document for basic fuzzy prefix test",
+  "documents": [
+    { "_id": "1", "field": "abc" }
+  ]
+}
diff --git a/search-test-suite/datasets/fuzzy-names.json b/search-test-suite/datasets/fuzzy-names.json
@@ -0,0 +1,24 @@
+{
+  "id": "fuzzy-names",
+  "description": "German-style names for fuzzy matching with real-world data",
+  "documents": [
+    { "_id": "1", "field": "LANGE" },
+    { "_id": "2", "field": "LUETH" },
+    { "_id": "3", "field": "PIRSING" },
+    { "_id": "4", "field": "RIEGEL" },
+    { "_id": "5", "field": "TRZECZIAK" },
+    { "_id": "6", "field": "WALKER" },
+    { "_id": "7", "field": "WBR" },
+    { "_id": "8", "field": "WE" },
+    { "_id": "9", "field": "WEB" },
+    { "_id": "10", "field": "WEBE" },
+    { "_id": "11", "field": "WEBER" },
+    { "_id": "12", "field": "WEBERE" },
+    { "_id": "13", "field": "WEBREE" },
+    { "_id": "14", "field": "WEBEREI" },
+    { "_id": "15", "field": "WBRE" },
+    { "_id": "16", "field": "WITTKOPF" },
+    { "_id": "17", "field": "WOJNAROWSKI" },
+    { "_id": "18", "field": "WRICKE" }
+  ]
+}
diff --git a/search-test-suite/datasets/fuzzy-words.json b/search-test-suite/datasets/fuzzy-words.json
@@ -0,0 +1,13 @@
+{
+  "id": "fuzzy-words",
+  "description": "Words with incremental edit distances for fuzzy matching tests",
+  "documents": [
+    { "_id": "1", "field": "aaaaa" },
+    { "_id": "2", "field": "aaaab" },
+    { "_id": "3", "field": "aaabb" },
+    { "_id": "4", "field": "aabbb" },
+    { "_id": "5", "field": "abbbb" },
+    { "_id": "6", "field": "bbbbb" },
+    { "_id": "7", "field": "ddddd" }
+  ]
+}
diff --git a/search-test-suite/datasets/match-all-basic.json b/search-test-suite/datasets/match-all-basic.json
@@ -0,0 +1,9 @@
+{
+  "id": "match-all-basic",
+  "description": "Simple documents for match-all and match-none query tests",
+  "documents": [
+    { "_id": "1", "key": "one" },
+    { "_id": "2", "key": "two" },
+    { "_id": "3", "key": "three four" }
+  ]
+}
diff --git a/search-test-suite/datasets/match-none-basic.json b/search-test-suite/datasets/match-none-basic.json
@@ -0,0 +1,9 @@
+{
+  "id": "match-none-basic",
+  "description": "Simple documents for match-none query tests",
+  "documents": [
+    { "_id": "1", "key": "one" },
+    { "_id": "2", "key": "two" },
+    { "_id": "3", "key": "three" }
+  ]
+}
diff --git a/search-test-suite/datasets/phrase-sentences.json b/search-test-suite/datasets/phrase-sentences.json
@@ -0,0 +1,15 @@
+{
+  "id": "phrase-sentences",
+  "description": "Documents with ordered word sequences for phrase query tests",
+  "documents": [
+    {
+      "_id": "1",
+      "field": "one two three four five",
+      "repeated": "this is a repeated field - first part",
+      "repeated_2": "second part of a repeated field",
+      "palindrome": "one two three two one"
+    },
+    { "_id": "2", "nonexist": "phrase exist notexist exist found" },
+    { "_id": "3", "nonexist": "phrase exist notexist exist found" }
+  ]
+}
diff --git a/search-test-suite/datasets/prefix-categories.json b/search-test-suite/datasets/prefix-categories.json
@@ -0,0 +1,9 @@
+{
+  "id": "prefix-categories",
+  "description": "Hierarchical category paths for prefix query tests",
+  "documents": [
+    { "_id": "1", "category": "/Computers" },
+    { "_id": "2", "category": "/Computers/Mac" },
+    { "_id": "3", "category": "/Computers/Windows" }
+  ]
+}
diff --git a/search-test-suite/datasets/range-letters.json b/search-test-suite/datasets/range-letters.json
@@ -0,0 +1,10 @@
+{
+  "id": "range-letters",
+  "description": "Single-letter documents for range query tests",
+  "documents": [
+    { "_id": "1", "content": "A" },
+    { "_id": "2", "content": "B" },
+    { "_id": "3", "content": "C" },
+    { "_id": "4", "content": "D" }
+  ]
+}
diff --git a/search-test-suite/datasets/regexp-text.json b/search-test-suite/datasets/regexp-text.json
@@ -0,0 +1,7 @@
+{
+  "id": "regexp-text",
+  "description": "Rich text document for regexp query pattern matching tests",
+  "documents": [
+    { "_id": "1", "field": "the quick brown fox jumps over the lazy ??? dog 493432 49344 [foo] 12.3 \\ ς" }
+  ]
+}
diff --git a/search-test-suite/datasets/term-basic.json b/search-test-suite/datasets/term-basic.json
@@ -0,0 +1,8 @@
+{
+  "id": "term-basic",
+  "description": "Simple documents for basic term query matching",
+  "documents": [
+    { "_id": "1", "foo": "bar" },
+    { "_id": "2", "foo": "baz" }
+  ]
+}
diff --git a/search-test-suite/datasets/wildcard-metals.json b/search-test-suite/datasets/wildcard-metals.json
@@ -0,0 +1,8 @@
+{
+  "id": "wildcard-metals",
+  "description": "Metal-related words for wildcard pattern matching tests",
+  "documents": [
+    { "_id": "1", "body": "metal" },
+    { "_id": "2", "body": "metals" }
+  ]
+}
diff --git a/search-test-suite/datasets/wildcard-questionmark.json b/search-test-suite/datasets/wildcard-questionmark.json
@@ -0,0 +1,10 @@
+{
+  "id": "wildcard-questionmark",
+  "description": "Words for single-character wildcard (?) tests",
+  "documents": [
+    { "_id": "1", "body": "metal" },
+    { "_id": "2", "body": "metals" },
+    { "_id": "3", "body": "mXtals" },
+    { "_id": "4", "body": "mXtXls" }
+  ]
+}