Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
159 changes: 159 additions & 0 deletions search-test-suite/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
# Search Test Suite

A portable, engine-agnostic test specification format for validating search engine query implementations.

## Overview

This test suite captures the **semantic contract** of core search query operations — datasets, schemas, queries, and expected results — in a JSON format that can be consumed by any search engine implementation (Lucene, Elasticsearch, a Rust-based engine, etc.).

Tests are extracted from Apache Lucene's core query test suite but are deliberately **implementation-independent**: no Java types, no scorer internals, no codec details.

## Directory Structure

```
search-test-suite/
├── README.md # This file
├── schema.json # JSON Schema for validating test files
├── datasets/ # Reusable document datasets
├── schemas/ # Reusable field/index schemas
└── tests/ # Test suites, one folder per query type
├── term_query/
├── boolean_query/
├── phrase_query/
├── fuzzy_query/
├── prefix_query/
├── wildcard_query/
├── range_query/
├── regexp_query/
├── match_all_query/
└── match_none_query/
```

## Format Specification

### Dataset Files

Each dataset file (`datasets/*.json`) defines a reusable set of documents:

```json
{
"id": "dataset-name",
"description": "Human-readable description",
"documents": [
{ "_id": "1", "field_name": "field_value", ... }
]
}
```

### Schema Files

Each schema file (`schemas/*.json`) defines the index field mappings:

```json
{
"id": "schema-name",
"description": "Human-readable description",
"fields": [
{
"name": "field_name",
"type": "text|keyword|integer|long|float|double|boolean|date",
"stored": true,
"indexed": true,
"analyzer": { // optional, for "text" fields
"tokenizer": "whitespace|standard",
"lowercase": false,
"position_increment_gap": 100
}
}
]
}
```

**Field Types:**
- `keyword` — Not analyzed, exact match (equivalent to Lucene `StringField`)
- `text` — Analyzed/tokenized (equivalent to Lucene `TextField`)
- `integer`, `long`, `float`, `double` — Numeric types
- `boolean`, `date` — Other common types

### Test Files

Each test file (`tests/<query_type>/*.json`) defines one or more test cases:

```json
{
"description": "Test suite description",
"source": "Original Lucene test class and method",
"dataset": "dataset-id",
"schema": "schema-id",
"tests": [
{
"id": "unique-test-id",
"description": "What this test verifies",
"query": { ... },
"expected": { ... }
}
]
}
```

### Query Types

| Type | Key Fields |
|------|-----------|
| `term` | `field`, `value` |
| `boolean` | `clauses[]` with `occur` (must/should/must_not/filter), `min_should_match` |
| `phrase` | `field`, `terms[]`, `slop` |
| `fuzzy` | `field`, `value`, `max_edits`, `prefix_length`, `max_expansions` |
| `prefix` | `field`, `value` |
| `wildcard` | `field`, `pattern` |
| `regexp` | `field`, `pattern`, `flags` |
| `range` | `field`, `lower`, `upper`, `include_lower`, `include_upper` |
| `match_all` | _(none)_ |
| `match_none` | `reason` (optional) |

### Expected Results

```json
{
"count": 3, // exact total hit count
"count_min": 1, // minimum hits (for approximate checks)
"ordered": ["val1", "val2"], // strict order by relevance (field values)
"hits": {
"must_contain": ["val1", "val2"], // these field values MUST appear in results
"must_not_contain": ["val3"] // these field values MUST NOT appear
},
"match_field": "field" // which stored field to check (default: query field)
}
```

## Writing a Test Runner

Any conforming test runner must:

1. **Read** a test JSON file
2. **Resolve** the referenced dataset and schema files
3. **Create** an in-memory index from the dataset using the schema's field definitions
4. **Build** the query from the `query` block
5. **Execute** the query against the index
6. **Assert** against the `expected` block:
- `count` → exact match on total hits
- `count_min` → total hits >= value
- `ordered` → strict sequence match on stored field values in result order
- `hits.must_contain` → all listed values are found in results
- `hits.must_not_contain` → none of the listed values are found in results

## Validation

Use the provided `schema.json` to validate test files:

```bash
# Using ajv-cli (Node.js)
npx ajv validate -s schema.json -d "tests/**/*.json"

# Using check-jsonschema (Python)
check-jsonschema --schemafile schema.json tests/**/*.json
```

## License

This test suite is derived from Apache Lucene test cases, licensed under the Apache License, Version 2.0.
8 changes: 8 additions & 0 deletions search-test-suite/datasets/boolean-demorgan.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"id": "boolean-demorgan",
"description": "Documents for De Morgan boolean logic tests",
"documents": [
{ "_id": "1", "field": "foo bar" },
{ "_id": "2", "field": "foo baz" }
]
}
9 changes: 9 additions & 0 deletions search-test-suite/datasets/boolean-mixed.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"id": "boolean-mixed",
"description": "Documents with multiple text fields for boolean query tests",
"documents": [
{ "_id": "1", "key": "one" },
{ "_id": "2", "key": "two" },
{ "_id": "3", "key": "three four" }
]
}
7 changes: 7 additions & 0 deletions search-test-suite/datasets/fuzzy-basic.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"id": "fuzzy-basic",
"description": "Single document for basic fuzzy prefix test",
"documents": [
{ "_id": "1", "field": "abc" }
]
}
24 changes: 24 additions & 0 deletions search-test-suite/datasets/fuzzy-names.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
{
"id": "fuzzy-names",
"description": "German-style names for fuzzy matching with real-world data",
"documents": [
{ "_id": "1", "field": "LANGE" },
{ "_id": "2", "field": "LUETH" },
{ "_id": "3", "field": "PIRSING" },
{ "_id": "4", "field": "RIEGEL" },
{ "_id": "5", "field": "TRZECZIAK" },
{ "_id": "6", "field": "WALKER" },
{ "_id": "7", "field": "WBR" },
{ "_id": "8", "field": "WE" },
{ "_id": "9", "field": "WEB" },
{ "_id": "10", "field": "WEBE" },
{ "_id": "11", "field": "WEBER" },
{ "_id": "12", "field": "WEBERE" },
{ "_id": "13", "field": "WEBREE" },
{ "_id": "14", "field": "WEBEREI" },
{ "_id": "15", "field": "WBRE" },
{ "_id": "16", "field": "WITTKOPF" },
{ "_id": "17", "field": "WOJNAROWSKI" },
{ "_id": "18", "field": "WRICKE" }
]
}
13 changes: 13 additions & 0 deletions search-test-suite/datasets/fuzzy-words.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
{
"id": "fuzzy-words",
"description": "Words with incremental edit distances for fuzzy matching tests",
"documents": [
{ "_id": "1", "field": "aaaaa" },
{ "_id": "2", "field": "aaaab" },
{ "_id": "3", "field": "aaabb" },
{ "_id": "4", "field": "aabbb" },
{ "_id": "5", "field": "abbbb" },
{ "_id": "6", "field": "bbbbb" },
{ "_id": "7", "field": "ddddd" }
]
}
9 changes: 9 additions & 0 deletions search-test-suite/datasets/match-all-basic.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"id": "match-all-basic",
"description": "Simple documents for match-all and match-none query tests",
"documents": [
{ "_id": "1", "key": "one" },
{ "_id": "2", "key": "two" },
{ "_id": "3", "key": "three four" }
]
}
9 changes: 9 additions & 0 deletions search-test-suite/datasets/match-none-basic.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"id": "match-none-basic",
"description": "Simple documents for match-none query tests",
"documents": [
{ "_id": "1", "key": "one" },
{ "_id": "2", "key": "two" },
{ "_id": "3", "key": "three" }
]
}
15 changes: 15 additions & 0 deletions search-test-suite/datasets/phrase-sentences.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"id": "phrase-sentences",
"description": "Documents with ordered word sequences for phrase query tests",
"documents": [
{
"_id": "1",
"field": "one two three four five",
"repeated": "this is a repeated field - first part",
"repeated_2": "second part of a repeated field",
"palindrome": "one two three two one"
},
{ "_id": "2", "nonexist": "phrase exist notexist exist found" },
{ "_id": "3", "nonexist": "phrase exist notexist exist found" }
]
}
9 changes: 9 additions & 0 deletions search-test-suite/datasets/prefix-categories.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"id": "prefix-categories",
"description": "Hierarchical category paths for prefix query tests",
"documents": [
{ "_id": "1", "category": "/Computers" },
{ "_id": "2", "category": "/Computers/Mac" },
{ "_id": "3", "category": "/Computers/Windows" }
]
}
10 changes: 10 additions & 0 deletions search-test-suite/datasets/range-letters.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{
"id": "range-letters",
"description": "Single-letter documents for range query tests",
"documents": [
{ "_id": "1", "content": "A" },
{ "_id": "2", "content": "B" },
{ "_id": "3", "content": "C" },
{ "_id": "4", "content": "D" }
]
}
7 changes: 7 additions & 0 deletions search-test-suite/datasets/regexp-text.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"id": "regexp-text",
"description": "Rich text document for regexp query pattern matching tests",
"documents": [
{ "_id": "1", "field": "the quick brown fox jumps over the lazy ??? dog 493432 49344 [foo] 12.3 \\ ς" }
]
}
8 changes: 8 additions & 0 deletions search-test-suite/datasets/term-basic.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"id": "term-basic",
"description": "Simple documents for basic term query matching",
"documents": [
{ "_id": "1", "foo": "bar" },
{ "_id": "2", "foo": "baz" }
]
}
8 changes: 8 additions & 0 deletions search-test-suite/datasets/wildcard-metals.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"id": "wildcard-metals",
"description": "Metal-related words for wildcard pattern matching tests",
"documents": [
{ "_id": "1", "body": "metal" },
{ "_id": "2", "body": "metals" }
]
}
10 changes: 10 additions & 0 deletions search-test-suite/datasets/wildcard-questionmark.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{
"id": "wildcard-questionmark",
"description": "Words for single-character wildcard (?) tests",
"documents": [
{ "_id": "1", "body": "metal" },
{ "_id": "2", "body": "metals" },
{ "_id": "3", "body": "mXtals" },
{ "_id": "4", "body": "mXtXls" }
]
}
Loading