Adding logic for histogram aggregation using skiplist #19130

jainankitk · 2025-08-22T23:12:52Z

Description

This PR adds logic for histogram collection using skiplist. PR not to be reviewed, just poc for how skiplist might help efficiently collect the matching documents for bucket aggregation use cases

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2025-08-22T23:18:24Z

❌ Gradle check result for 2747c0d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

asimmahmood1 · 2025-08-27T21:51:33Z

Thanks for the draft.

I tested the changes using 20% of nyc_taxis corpus, so ~4GB.

Query

curl -XGET "http://localhost:9200/nyc_taxis/_search" -H "Content-Type: application/json" -d '{
  "size": 0,
  "query": {
            "match_all": {}
        },
        "aggs": {
          "dropoffs_over_time": {
            "date_histogram": {
              "field": "dropoff_datetime",
              "calendar_interval": "month"
            }
          }
        }
}'

The values aren't correct, baseline on avg 600, and skiplist was 550, so not a huge difference.

The flame graph shows that skiplist collector spent more of time trying to add to bucket, so your idea of collecting locally until next bucket would really help.

Baseline

baseline_dateagg_nync.html

Skiplist

skiplist_dateagg_nync.html

asimmahmood1 · 2025-08-27T21:53:18Z

...c/main/java/org/opensearch/search/aggregations/bucket/histogram/DateHistogramAggregator.java

        final NumericDocValues singleton = DocValues.unwrapSingleton(values);

+        // If no subaggregations, we can use skip list based collector
+        if (sub == null && skipper != null) {


sub won't be null, it'll be equal to LeafBucketCollector.NO_OP_COLLECTOR.

bowenlan-amzn · 2025-09-02T17:44:43Z

...c/main/java/org/opensearch/search/aggregations/bucket/histogram/DateHistogramAggregator.java


+        DocValuesSkipper skipper = null;
+        if (this.fieldName != null) {
+            ctx.reader().getDocValuesSkipper(this.fieldName);


do we need to assign here
skipper = ctx.reader().getDocValuesSkipper(this.fieldName);

Yes, made that change as well, let me update this PR.

Good catch, @bowenlan-amzn !

jainankitk · 2025-09-03T20:50:08Z

@asimmahmood1 - As discussed offline, I realized that we can get maximum benefit using skip_list if the index itself is sorted on the field for which skip_list is being used to align the docId with docValues for that specific field. Let us discuss further once we have updated numbers on the data indexed with sort field specified

asimmahmood1 · 2025-09-04T18:25:55Z

I tested this change with index sort enabled on dropoff_datetime (nyc_taxis does not have @timestamp field). There a major speed up compared to doc value agg.

sort": [
                  {
                    "field": "dropoff_datetime",
                    "mode": "min",
                    "missing": "9223372036854775807",
                    "reverse": false
                  }

query:


curl -XGET "http://localhost:9200/nyc_taxis/_search" -H "Content-Type: application/json" -d '{
  "size": 0,
  "query": {
            "match_all": {}
        },
        "aggs": {
          "dropoffs_over_time": {
            "date_histogram": {
              "field": "dropoff_datetime",
              "calendar_interval": "month"
            }
          }
        }
}'

Results

baseline (bkd)	histo (skiplist)	histo (no skiplist)
5	11	630

Let me capture flamegraphs.

github-actions · 2025-09-04T18:36:22Z

Hello!
We have added a performance benchmark workflow that runs by adding a comment on the PR.
Please refer https://github.com/opensearch-project/OpenSearch/blob/main/PERFORMANCE_BENCHMARKS.md on how to run benchmarks on pull requests.

github-actions · 2025-09-04T18:36:23Z

Hello!
We have added a performance benchmark workflow that runs by adding a comment on the PR.
Please refer https://github.com/opensearch-project/OpenSearch/blob/main/PERFORMANCE_BENCHMARKS.md on how to run benchmarks on pull requests.

jainankitk · 2025-09-04T22:54:19Z

Do these gains also apply with query filter on field other than dropoff_datetime? This is the case in which BKD based optimization cannot be leveraged, but skip_list might provide the benefit. I would expect to similarly benefit, but will be good to validate the same

asimmahmood1 · 2025-09-05T16:55:09Z

Tried with range filter, skiplist ~160ms vs noskiplist 500ms. So it definitely helps.

curl -XGET "http://localhost:9200/nyc_taxis/_search" -H "Content-Type: application/json" -d '{
  "size": 0,
  "query": {
          "range": {
            "dropoff_datetime": {
              "gte": "2015-01-01 00:00:00",
              "lt": "2016-01-01 00:00:00"
            }
          }
        },
        "aggs": {
          "dropoffs_over_time": {
            "date_histogram": {
              "field": "dropoff_datetime",
              "calendar_interval": "month"
            }
          }
        }
}'

asimmahmood1 · 2025-09-18T18:46:45Z

Tested with filter query on another field, 24ms baseline vs 15ms candidate

Still trying to debug why candidate has slight lower values, e.g. '2015-01-01 00:00:00' -> 33981 vs 33713

curl -XGET "http://localhost:9200/nyc_taxis/_search" -H "Content-Type: application/json" -d '{
  "size": 0,
  "query": {
    "term": {
      "trip_type": 2
    }
  },
  "aggs": {
    "dropoffs_over_time": {
      "date_histogram": {
        "field": "dropoff_datetime",
        "calendar_interval": "month"
      }
    }
  }
}'

Note: trip_type: 1=> 8075369, 2=>194619 hits

Baseline - 24ms

Details

``` { "took": 24, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 10000, "relation": "gte" }, "max_score": null, "hits": [] }, "aggregations": { "dropoffs_over_time": { "buckets": [ { "key_as_string": "2015-01-01 00:00:00", "key": 1420070400000, "doc_count": 33981 }, { "key_as_string": "2015-02-01 00:00:00", "key": 1422748800000, "doc_count": 36104 }, { "key_as_string": "2015-03-01 00:00:00", "key": 1425168000000, "doc_count": 41800 }, { "key_as_string": "2015-04-01 00:00:00", "key": 1427846400000, "doc_count": 40632 }, { "key_as_string": "2015-05-01 00:00:00", "key": 1430438400000, "doc_count": 41584 }, { "key_as_string": "2015-06-01 00:00:00", "key": 1433116800000, "doc_count": 518 } ] } } } ```

Canddiate - 15ms

Details

``` { "took": 15, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 10000, "relation": "gte" }, "max_score": null, "hits": [] }, "aggregations": { "dropoffs_over_time": { "buckets": [ { "key_as_string": "2015-01-01 00:00:00", "key": 1420070400000, "doc_count": 33713 }, { "key_as_string": "2015-02-01 00:00:00", "key": 1422748800000, "doc_count": 36015 }, { "key_as_string": "2015-03-01 00:00:00", "key": 1425168000000, "doc_count": 41770 }, { "key_as_string": "2015-04-01 00:00:00", "key": 1427846400000, "doc_count": 39854 }, { "key_as_string": "2015-05-01 00:00:00", "key": 1430438400000, "doc_count": 41239 }, { "key_as_string": "2015-06-01 00:00:00", "key": 1433116800000, "doc_count": 506 } ] } } } ```

asimmahmood1 · 2025-09-18T21:39:57Z

Max Agg

Similar logic applied to Max agg,

Took time is same or slightly higher, but same issue with logical,

Baseline returned 2015-06-02 05:54:31 but candidate returned higher 2015-06-20 10:17:04

Correct value is baseline:

grep '2015-06-02 05:54:31' ~/.benchmark/benchmarks/data/nyc_taxis/documents.json
{"pickup_datetime": "2015-06-01 07:44:07", "vendor_id": "2", "pickup_location": [-73.92903137207031, 40.80078125], "extra": 0.0, "total_amount": 36.0, "rate_code_id": "5", "fare_amount": 30.0, "improvement_surcharge": 0.0, "trip_distance": 6.23, "mta_tax": 0.0, "tolls_amount": 0.0, "payment_type": "1", "store_and_fwd_flag": "N", "trip_type": "2", "passenger_count": 1, "tip_amount": 6.0, "dropoff_location": [-73.86468505859375, 40.7701530456543], "dropoff_datetime": "2015-06-02 05:54:31"}

#!/bin/bash

curl -XGET "http://localhost:9200/nyc_taxis/_search" -H "Content-Type: application/json" -d '{
  "size": 0,
  "query": {
    "term": {
      "trip_type": 2
    }
  },
  "aggs": {
    "dropoffs_over_time": {
      "max": {
        "field": "dropoff_datetime"
      }
    }
  }
}'

Baseline

Details

{
  "took": 12,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "dropoffs_over_time": {
      "value": 1.433224471E+12,
      "value_as_string": "2015-06-02 05:54:31"
    }
  }
}

Candidate

Details

{
  "took": 14,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "dropoffs_over_time": {
      "value": 1.434795424E+12,
      "value_as_string": "2015-06-20 10:17:04"
    }
  }
}

jainankitk · 2025-09-18T22:03:19Z

Similar logic applied to Max agg,
Took time is same or slightly higher, but same issue with logical,
Baseline returned 2015-06-02 05:54:31 but candidate returned higher 2015-06-20 10:17:04

I am curious about the changes you made to apply the logic for max aggregation. We should simply be able to pick the max docId from DISI as that should have the maximum dropoff_datetime which should be super fast as the docId's are aligned with dropoff_datetime values. Am I missing something here?

jainankitk · 2025-09-18T22:03:54Z

Also, let us try to fix the correctness for DateHistogram before trying other aggregation types?

asimmahmood1 · 2025-09-19T19:47:57Z

...c/main/java/org/opensearch/search/aggregations/bucket/histogram/DateHistogramAggregator.java

+                incrementDocCount.accept(upToBucketIndex, 1L);
+            } else if (values.advanceExact(doc)) {
+                final long value = values.longValue();
+                bucketOrds.add(0, preparedRounding.round(value));


Missing incrementDocCount in this branch, let me update the commit.

* sub agg will not be null for single aggregation, but will be NO_OP TODO: clean up code Signed-off-by: Asim Mahmood <[email protected]>

asimmahmood1 · 2025-09-19T23:30:18Z

After the fix, the doc count values are correct.

I've added 1 unit test to assert correctness, but not that its using skiplist, let me figure that out.

I'll clean up the code and have it ready for review.

Candidate

{
  "took": 11,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "dropoffs_over_time": {
      "buckets": [
        {
          "key_as_string": "2015-01-01 00:00:00",
          "key": 1420070400000,
          "doc_count": 33981
        },
        {
          "key_as_string": "2015-02-01 00:00:00",
          "key": 1422748800000,
          "doc_count": 36104
        },
        {
          "key_as_string": "2015-03-01 00:00:00",
          "key": 1425168000000,
          "doc_count": 41800
        },
        {
          "key_as_string": "2015-04-01 00:00:00",
          "key": 1427846400000,
          "doc_count": 40632
        },
        {
          "key_as_string": "2015-05-01 00:00:00",
          "key": 1430438400000,
          "doc_count": 41584
        },
        {
          "key_as_string": "2015-06-01 00:00:00",
          "key": 1433116800000,
          "doc_count": 518
        }
      ]
    }
  }
}

Baseline

{
  "took": 24,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "dropoffs_over_time": {
      "buckets": [
        {
          "key_as_string": "2015-01-01 00:00:00",
          "key": 1420070400000,
          "doc_count": 33981
        },
        {
          "key_as_string": "2015-02-01 00:00:00",
          "key": 1422748800000,
          "doc_count": 36104
        },
        {
          "key_as_string": "2015-03-01 00:00:00",
          "key": 1425168000000,
          "doc_count": 41800
        },
        {
          "key_as_string": "2015-04-01 00:00:00",
          "key": 1427846400000,
          "doc_count": 40632
        },
        {
          "key_as_string": "2015-05-01 00:00:00",
          "key": 1430438400000,
          "doc_count": 41584
        },
        {
          "key_as_string": "2015-06-01 00:00:00",
          "key": 1433116800000,
          "doc_count": 518
        }
      ]
    }
  }
}

github-actions · 2025-09-19T23:33:39Z

❌ Gradle check result for 860dc3e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Adding logic for histogram aggregation using skiplist

2747c0d

jainankitk requested review from anasalkouz, andrross, ashking94, Bukhtawar, CEHENKLE, cwperks, dbwiddis, gbbafna, jed326, kotwanikunal, mch2, msfroh, owaiskazi19, reta, Rishikesh1159, sachinpkale, saratvemulapalli, shwetathareja, sohami, VachaShah and a team as code owners August 22, 2025 23:12

jainankitk marked this pull request as draft August 22, 2025 23:13

asimmahmood1 reviewed Aug 27, 2025

View reviewed changes

bowenlan-amzn reviewed Sep 2, 2025

View reviewed changes

asimmahmood1 added this to Performance Roadmap Sep 4, 2025

github-project-automation bot moved this to Todo in Performance Roadmap Sep 4, 2025

asimmahmood1 moved this from Todo to In Progress in Performance Roadmap Sep 4, 2025

asimmahmood1 assigned asimmahmood1 and jainankitk Sep 4, 2025

asimmahmood1 added Search:Aggregations Performance This is for any performance related enhancements or bugs Search:Performance labels Sep 4, 2025

asimmahmood1 reviewed Sep 19, 2025

View reviewed changes

Fix increment doc count, add unit test

860dc3e

* sub agg will not be null for single aggregation, but will be NO_OP TODO: clean up code Signed-off-by: Asim Mahmood <[email protected]>

Adding logic for histogram aggregation using skiplist #19130

Are you sure you want to change the base?

Adding logic for histogram aggregation using skiplist #19130

Conversation

jainankitk commented Aug 22, 2025

Description

Related Issues

Check List

Uh oh!

github-actions bot commented Aug 22, 2025

Uh oh!

asimmahmood1 commented Aug 27, 2025

Query

Baseline

Skiplist

Uh oh!

asimmahmood1 Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

bowenlan-amzn Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

asimmahmood1 Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

jainankitk Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

jainankitk commented Sep 3, 2025

Uh oh!

asimmahmood1 commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results

Uh oh!

github-actions bot commented Sep 4, 2025

Uh oh!

github-actions bot commented Sep 4, 2025

Uh oh!

jainankitk commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asimmahmood1 commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asimmahmood1 commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Baseline - 24ms

Canddiate - 15ms

Uh oh!

asimmahmood1 commented Sep 18, 2025

Max Agg

Baseline

Candidate

Uh oh!

jainankitk commented Sep 18, 2025

Uh oh!

jainankitk commented Sep 18, 2025

Uh oh!

asimmahmood1 Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

asimmahmood1 commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Candidate

Baseline

Uh oh!

github-actions bot commented Sep 19, 2025

Uh oh!

Uh oh!

asimmahmood1 commented Sep 4, 2025 •

edited

Loading

jainankitk commented Sep 4, 2025 •

edited

Loading

asimmahmood1 commented Sep 5, 2025 •

edited

Loading

asimmahmood1 commented Sep 18, 2025 •

edited

Loading

asimmahmood1 commented Sep 19, 2025 •

edited

Loading