-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Adding logic for histogram aggregation using skiplist #19130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
❌ Gradle check result for 2747c0d: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
final NumericDocValues singleton = DocValues.unwrapSingleton(values); | ||
|
||
// If no subaggregations, we can use skip list based collector | ||
if (sub == null && skipper != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sub won't be null, it'll be equal to LeafBucketCollector.NO_OP_COLLECTOR
.
|
||
DocValuesSkipper skipper = null; | ||
if (this.fieldName != null) { | ||
ctx.reader().getDocValuesSkipper(this.fieldName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to assign here
skipper = ctx.reader().getDocValuesSkipper(this.fieldName);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, made that change as well, let me update this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, @bowenlan-amzn !
@asimmahmood1 - As discussed offline, I realized that we can get maximum benefit using skip_list if the index itself is sorted on the field for which skip_list is being used to align the docId with docValues for that specific field. Let us discuss further once we have updated numbers on the data indexed with sort field specified |
I tested this change with index sort enabled on
query:
Results
Let me capture flamegraphs. |
Hello! |
1 similar comment
Hello! |
Do these gains also apply with query filter on field other than |
Tried with range filter, skiplist ~160ms vs noskiplist 500ms. So it definitely helps.
|
Tested with filter query on another field, 24ms baseline vs 15ms candidate Still trying to debug why candidate has slight lower values, e.g. '2015-01-01 00:00:00' -> 33981 vs 33713
Note: trip_type: 1=> 8075369, 2=>194619 hits Baseline - 24msDetails``` { "took": 24, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 10000, "relation": "gte" }, "max_score": null, "hits": [] }, "aggregations": { "dropoffs_over_time": { "buckets": [ { "key_as_string": "2015-01-01 00:00:00", "key": 1420070400000, "doc_count": 33981 }, { "key_as_string": "2015-02-01 00:00:00", "key": 1422748800000, "doc_count": 36104 }, { "key_as_string": "2015-03-01 00:00:00", "key": 1425168000000, "doc_count": 41800 }, { "key_as_string": "2015-04-01 00:00:00", "key": 1427846400000, "doc_count": 40632 }, { "key_as_string": "2015-05-01 00:00:00", "key": 1430438400000, "doc_count": 41584 }, { "key_as_string": "2015-06-01 00:00:00", "key": 1433116800000, "doc_count": 518 } ] } } } ```Canddiate - 15msDetails``` { "took": 15, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 10000, "relation": "gte" }, "max_score": null, "hits": [] }, "aggregations": { "dropoffs_over_time": { "buckets": [ { "key_as_string": "2015-01-01 00:00:00", "key": 1420070400000, "doc_count": 33713 }, { "key_as_string": "2015-02-01 00:00:00", "key": 1422748800000, "doc_count": 36015 }, { "key_as_string": "2015-03-01 00:00:00", "key": 1425168000000, "doc_count": 41770 }, { "key_as_string": "2015-04-01 00:00:00", "key": 1427846400000, "doc_count": 39854 }, { "key_as_string": "2015-05-01 00:00:00", "key": 1430438400000, "doc_count": 41239 }, { "key_as_string": "2015-06-01 00:00:00", "key": 1433116800000, "doc_count": 506 } ] } } } ``` |
Max AggSimilar logic applied to Max agg, Took time is same or slightly higher, but same issue with logical, Baseline returned Correct value is baseline:
BaselineDetails
CandidateDetails
|
I am curious about the changes you made to apply the logic for max aggregation. We should simply be able to pick the max docId from DISI as that should have the maximum |
Also, let us try to fix the correctness for |
incrementDocCount.accept(upToBucketIndex, 1L); | ||
} else if (values.advanceExact(doc)) { | ||
final long value = values.longValue(); | ||
bucketOrds.add(0, preparedRounding.round(value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing incrementDocCount in this branch, let me update the commit.
* sub agg will not be null for single aggregation, but will be NO_OP TODO: clean up code Signed-off-by: Asim Mahmood <[email protected]>
After the fix, the doc count values are correct. I've added 1 unit test to assert correctness, but not that its using skiplist, let me figure that out. I'll clean up the code and have it ready for review. Candidate
Baseline
|
❌ Gradle check result for 860dc3e: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Description
This PR adds logic for histogram collection using skiplist. PR not to be reviewed, just poc for how skiplist might help efficiently collect the matching documents for bucket aggregation use cases
Related Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.