-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Description
Lucene currently has two ways to retrieve the global min/max value of a numeric field across segments:
PointValues.getMinPackedValue()/PointValues.getMaxPackedValue(): returnsnullwhen no points exist for the field.DocValuesSkipper.globalMinValue()/DocValuesSkipper.globalMaxValue(): returns sentinel values (Long.MIN_VALUEorLong.MAX_VALUE) when no data exists or when the skipper is not available for a leaf reader.
These two APIs have different "no data" semantics. PointValues returns null, which callers can check for and handle cleanly. DocValuesSkipper returns sentinel values that callers must know about and filter out. Specifically:
globalMinValue()returnsLong.MAX_VALUEwhen no segments have the field, andLong.MIN_VALUEwhen a leaf reader has the field info but no skipper.globalMaxValue()returnsLong.MIN_VALUEwhen no segments have the field, andLong.MAX_VALUEwhen a leaf reader has the field info but no skipper.
This makes it error-prone for callers that need to retrieve min/max values from a field: they must first determine which data structure is available, then call the right API, and then handle the different "no data" conventions. If a caller picks the wrong API or forgets to filter sentinels, invalid values propagate silently.
Proposal
Introduce a unified API for retrieving the global min/max value of a numeric field, abstracting over the underlying data structure. The API should:
- Return
nullwhen no data exists, regardless of whether the field uses BKD trees or doc values skippers. - Automatically delegate to whichever data structure is available for the field.
- Define clear behavior when both structures are available or when neither is available (return
null).
A possible solution:
public record MinMax(long min, long max) {}
// Returns null if values cannot be loaded
public static MinMax getGlobalMinMax(IndexReader reader, String field) throws IOException { ... }This would eliminate the need for callers to know which underlying data structure a field uses and would prevent sentinel values from leaking into application logic.