Skip to content

Conversation

alexmm-amzn
Copy link
Contributor

@alexmm-amzn alexmm-amzn commented Sep 22, 2025

Description

Extends the FirstPassGroupingCollector to support pruning (for numeric sort fields using competitiveIterator) and skipping of non-competitive documents (for relevance score sorting using Scorable#setMinCompetitiveScore).

Both optimizations are enabled automatically, thereby reducing the hit count of the collector if circumstances allow.

@jainankitk Are we fine with enabling this by default, or do we need this configurable (e.g. configurable hit threshold)?

Benchmark results using luceneutils for the TermBGroup1M scenario (combines first and second pass grouping) using a modified wikimedium.10M.nostopwords.tasks job. This scenario uses sort by relevance score.

> grep TermBGroup1M tasks/wikimedium500.tasks > tasks/wikimedium.10M.nostopwords.tasks
> python src/python/localrun.py -source wikimediumall 

Running on m6a.2xlarge using Corretto 24:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                        PKLookup      161.09     (12.8%)      156.56     (11.2%)   -2.8% ( -23% -   24%) 0.460
                    TermBGroup1M       11.47     (14.8%)       13.44     (13.7%)   17.1% (  -9% -   53%) 0.000

=> ~17% overall performance improvement (first+second pass).

@jpountz I'm getting some rare test failures for TestGrouping caused by the assert canSetMinCompetitiveScore assertion in AssertingScorer#setMinCompetitiveScore, even though the FirstPassGroupingCollector uses ScoreMode.TOP_SCORES in all configurations when it calls Scorable#setMinCompetitiveScore. Is this a known issue?

Reproduce with: gradlew test --tests TestGrouping.testRandom -Dtests.seed=EC2EC279F564DD82 -Dtests.locale=de-AT -Dtests.timezone=America/St_Thomas -Dtests.asserts=true -Dtests.file.encoding=UTF-8

edit//Seems to be caused by the Weight that gets instantiated by the unit tests with either ScoreMode.COMPLETE or ScoreMode.COMPLETE_NO_SCORES regardless of the actual collectors. I updated the code to instantiate a new Weight instance for every collector that is in line with the collector ScoreMode.

Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

…roupingCollector (apache#15136)

Also adjust CachingCollector behavior to disable minimum competitive scores to allow caching the exhaustive list of hits.
Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@github-actions github-actions bot added this to the 11.0.0 milestone Sep 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant