Skip to content

Conversation

jainankitk
Copy link
Contributor

Description

This PR adds logic for histogram collection using skiplist. PR not to be reviewed, just poc for how skiplist might help efficiently collect the matching documents for bucket aggregation use cases

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

❌ Gradle check result for 2747c0d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@asimmahmood1
Copy link
Contributor

Thanks for the draft.

I tested the changes using 20% of nyc_taxis corpus, so ~4GB.

Query

curl -XGET "http://localhost:9200/nyc_taxis/_search" -H "Content-Type: application/json" -d '{
  "size": 0,
  "query": {
            "match_all": {}
        },
        "aggs": {
          "dropoffs_over_time": {
            "date_histogram": {
              "field": "dropoff_datetime",
              "calendar_interval": "month"
            }
          }
        }
}'

The values aren't correct, baseline on avg 600, and skiplist was 550, so not a huge difference.

The flame graph shows that skiplist collector spent more of time trying to add to bucket, so your idea of collecting locally until next bucket would really help.

Baseline

Screenshot 2025-08-27 at 2 46 08 PM

baseline_dateagg_nync.html

Skiplist

Screenshot 2025-08-27 at 2 46 49 PM

skiplist_dateagg_nync.html

final NumericDocValues singleton = DocValues.unwrapSingleton(values);

// If no subaggregations, we can use skip list based collector
if (sub == null && skipper != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sub won't be null, it'll be equal to LeafBucketCollector.NO_OP_COLLECTOR.


DocValuesSkipper skipper = null;
if (this.fieldName != null) {
ctx.reader().getDocValuesSkipper(this.fieldName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to assign here
skipper = ctx.reader().getDocValuesSkipper(this.fieldName);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, made that change as well, let me update this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, @bowenlan-amzn !

@jainankitk
Copy link
Contributor Author

@asimmahmood1 - As discussed offline, I realized that we can get maximum benefit using skip_list if the index itself is sorted on the field for which skip_list is being used to align the docId with docValues for that specific field. Let us discuss further once we have updated numbers on the data indexed with sort field specified

@asimmahmood1
Copy link
Contributor

asimmahmood1 commented Sep 4, 2025

I tested this change with index sort enabled on dropoff_datetime (nyc_taxis does not have @timestamp field). There a major speed up compared to doc value agg.

sort": [
                  {
                    "field": "dropoff_datetime",
                    "mode": "min",
                    "missing": "9223372036854775807",
                    "reverse": false
                  }

query:


curl -XGET "http://localhost:9200/nyc_taxis/_search" -H "Content-Type: application/json" -d '{
  "size": 0,
  "query": {
            "match_all": {}
        },
        "aggs": {
          "dropoffs_over_time": {
            "date_histogram": {
              "field": "dropoff_datetime",
              "calendar_interval": "month"
            }
          }
        }
}'

Results

baseline (bkd) histo (skiplist) histo (no skiplist)
5 11 630

Let me capture flamegraphs.

@asimmahmood1 asimmahmood1 moved this from Todo to In Progress in Performance Roadmap Sep 4, 2025
@asimmahmood1 asimmahmood1 added Search:Aggregations Performance This is for any performance related enhancements or bugs Search:Performance labels Sep 4, 2025
Copy link
Contributor

github-actions bot commented Sep 4, 2025

Hello!
We have added a performance benchmark workflow that runs by adding a comment on the PR.
Please refer https://github.com/opensearch-project/OpenSearch/blob/main/PERFORMANCE_BENCHMARKS.md on how to run benchmarks on pull requests.

1 similar comment
Copy link
Contributor

github-actions bot commented Sep 4, 2025

Hello!
We have added a performance benchmark workflow that runs by adding a comment on the PR.
Please refer https://github.com/opensearch-project/OpenSearch/blob/main/PERFORMANCE_BENCHMARKS.md on how to run benchmarks on pull requests.

@jainankitk
Copy link
Contributor Author

jainankitk commented Sep 4, 2025

Do these gains also apply with query filter on field other than dropoff_datetime? This is the case in which BKD based optimization cannot be leveraged, but skip_list might provide the benefit. I would expect to similarly benefit, but will be good to validate the same

@asimmahmood1
Copy link
Contributor

asimmahmood1 commented Sep 5, 2025

Tried with range filter, skiplist ~160ms vs noskiplist 500ms. So it definitely helps.

curl -XGET "http://localhost:9200/nyc_taxis/_search" -H "Content-Type: application/json" -d '{
  "size": 0,
  "query": {
          "range": {
            "dropoff_datetime": {
              "gte": "2015-01-01 00:00:00",
              "lt": "2016-01-01 00:00:00"
            }
          }
        },
        "aggs": {
          "dropoffs_over_time": {
            "date_histogram": {
              "field": "dropoff_datetime",
              "calendar_interval": "month"
            }
          }
        }
}'

@asimmahmood1
Copy link
Contributor

asimmahmood1 commented Sep 18, 2025

Tested with filter query on another field, 24ms baseline vs 15ms candidate

Still trying to debug why candidate has slight lower values, e.g. '2015-01-01 00:00:00' -> 33981 vs 33713

curl -XGET "http://localhost:9200/nyc_taxis/_search" -H "Content-Type: application/json" -d '{
  "size": 0,
  "query": {
    "term": {
      "trip_type": 2
    }
  },
  "aggs": {
    "dropoffs_over_time": {
      "date_histogram": {
        "field": "dropoff_datetime",
        "calendar_interval": "month"
      }
    }
  }
}'

Note: trip_type: 1=> 8075369, 2=>194619 hits

Baseline - 24ms

Details ``` { "took": 24, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 10000, "relation": "gte" }, "max_score": null, "hits": [] }, "aggregations": { "dropoffs_over_time": { "buckets": [ { "key_as_string": "2015-01-01 00:00:00", "key": 1420070400000, "doc_count": 33981 }, { "key_as_string": "2015-02-01 00:00:00", "key": 1422748800000, "doc_count": 36104 }, { "key_as_string": "2015-03-01 00:00:00", "key": 1425168000000, "doc_count": 41800 }, { "key_as_string": "2015-04-01 00:00:00", "key": 1427846400000, "doc_count": 40632 }, { "key_as_string": "2015-05-01 00:00:00", "key": 1430438400000, "doc_count": 41584 }, { "key_as_string": "2015-06-01 00:00:00", "key": 1433116800000, "doc_count": 518 } ] } } } ```

Canddiate - 15ms

Details ``` { "took": 15, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 10000, "relation": "gte" }, "max_score": null, "hits": [] }, "aggregations": { "dropoffs_over_time": { "buckets": [ { "key_as_string": "2015-01-01 00:00:00", "key": 1420070400000, "doc_count": 33713 }, { "key_as_string": "2015-02-01 00:00:00", "key": 1422748800000, "doc_count": 36015 }, { "key_as_string": "2015-03-01 00:00:00", "key": 1425168000000, "doc_count": 41770 }, { "key_as_string": "2015-04-01 00:00:00", "key": 1427846400000, "doc_count": 39854 }, { "key_as_string": "2015-05-01 00:00:00", "key": 1430438400000, "doc_count": 41239 }, { "key_as_string": "2015-06-01 00:00:00", "key": 1433116800000, "doc_count": 506 } ] } } } ```

@asimmahmood1
Copy link
Contributor

Max Agg

Similar logic applied to Max agg,

Took time is same or slightly higher, but same issue with logical,

Baseline returned 2015-06-02 05:54:31 but candidate returned higher 2015-06-20 10:17:04

Correct value is baseline:

grep '2015-06-02 05:54:31' ~/.benchmark/benchmarks/data/nyc_taxis/documents.json
{"pickup_datetime": "2015-06-01 07:44:07", "vendor_id": "2", "pickup_location": [-73.92903137207031, 40.80078125], "extra": 0.0, "total_amount": 36.0, "rate_code_id": "5", "fare_amount": 30.0, "improvement_surcharge": 0.0, "trip_distance": 6.23, "mta_tax": 0.0, "tolls_amount": 0.0, "payment_type": "1", "store_and_fwd_flag": "N", "trip_type": "2", "passenger_count": 1, "tip_amount": 6.0, "dropoff_location": [-73.86468505859375, 40.7701530456543], "dropoff_datetime": "2015-06-02 05:54:31"}
#!/bin/bash

curl -XGET "http://localhost:9200/nyc_taxis/_search" -H "Content-Type: application/json" -d '{
  "size": 0,
  "query": {
    "term": {
      "trip_type": 2
    }
  },
  "aggs": {
    "dropoffs_over_time": {
      "max": {
        "field": "dropoff_datetime"
      }
    }
  }
}'

Baseline

Details
{
  "took": 12,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "dropoffs_over_time": {
      "value": 1.433224471E+12,
      "value_as_string": "2015-06-02 05:54:31"
    }
  }
}

Candidate

Details
{
  "took": 14,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "dropoffs_over_time": {
      "value": 1.434795424E+12,
      "value_as_string": "2015-06-20 10:17:04"
    }
  }
}

@jainankitk
Copy link
Contributor Author

Similar logic applied to Max agg,
Took time is same or slightly higher, but same issue with logical,
Baseline returned 2015-06-02 05:54:31 but candidate returned higher 2015-06-20 10:17:04

I am curious about the changes you made to apply the logic for max aggregation. We should simply be able to pick the max docId from DISI as that should have the maximum dropoff_datetime which should be super fast as the docId's are aligned with dropoff_datetime values. Am I missing something here?

@jainankitk
Copy link
Contributor Author

Also, let us try to fix the correctness for DateHistogram before trying other aggregation types?

incrementDocCount.accept(upToBucketIndex, 1L);
} else if (values.advanceExact(doc)) {
final long value = values.longValue();
bucketOrds.add(0, preparedRounding.round(value));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing incrementDocCount in this branch, let me update the commit.

* sub agg will not be null for single aggregation, but will be NO_OP

TODO: clean up code

Signed-off-by: Asim Mahmood <[email protected]>
@asimmahmood1
Copy link
Contributor

asimmahmood1 commented Sep 19, 2025

After the fix, the doc count values are correct.

I've added 1 unit test to assert correctness, but not that its using skiplist, let me figure that out.

I'll clean up the code and have it ready for review.

Candidate

{
  "took": 11,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "dropoffs_over_time": {
      "buckets": [
        {
          "key_as_string": "2015-01-01 00:00:00",
          "key": 1420070400000,
          "doc_count": 33981
        },
        {
          "key_as_string": "2015-02-01 00:00:00",
          "key": 1422748800000,
          "doc_count": 36104
        },
        {
          "key_as_string": "2015-03-01 00:00:00",
          "key": 1425168000000,
          "doc_count": 41800
        },
        {
          "key_as_string": "2015-04-01 00:00:00",
          "key": 1427846400000,
          "doc_count": 40632
        },
        {
          "key_as_string": "2015-05-01 00:00:00",
          "key": 1430438400000,
          "doc_count": 41584
        },
        {
          "key_as_string": "2015-06-01 00:00:00",
          "key": 1433116800000,
          "doc_count": 518
        }
      ]
    }
  }
}

Baseline

{
  "took": 24,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "dropoffs_over_time": {
      "buckets": [
        {
          "key_as_string": "2015-01-01 00:00:00",
          "key": 1420070400000,
          "doc_count": 33981
        },
        {
          "key_as_string": "2015-02-01 00:00:00",
          "key": 1422748800000,
          "doc_count": 36104
        },
        {
          "key_as_string": "2015-03-01 00:00:00",
          "key": 1425168000000,
          "doc_count": 41800
        },
        {
          "key_as_string": "2015-04-01 00:00:00",
          "key": 1427846400000,
          "doc_count": 40632
        },
        {
          "key_as_string": "2015-05-01 00:00:00",
          "key": 1430438400000,
          "doc_count": 41584
        },
        {
          "key_as_string": "2015-06-01 00:00:00",
          "key": 1433116800000,
          "doc_count": 518
        }
      ]
    }
  }
}

Copy link
Contributor

❌ Gradle check result for 860dc3e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance This is for any performance related enhancements or bugs Search:Aggregations Search:Performance
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants