Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection pipeline cache will grow without shrinking #4461

Open
kostasrim opened this issue Jan 15, 2025 · 1 comment
Open

Connection pipeline cache will grow without shrinking #4461

kostasrim opened this issue Jan 15, 2025 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@kostasrim
Copy link
Contributor

kostasrim commented Jan 15, 2025

While investigating a datastore rss vs used memory gap we saw that
Dragonfly with high pipeline cache byets is high, not that dispatch queue was 0
pipeline_cache_bytes:3756692020
dispatch_queue_bytes:0

@kostasrim initial investigation leads:
We have a corner case that can be described as:

We call ShrinkPipelinePool() that "gradually releases the pipeline messages in the cache" and the way we do that is:

if (free_req_release_weight > stats_->num_conns)  {
  //blah blah
  pipeline_req_pool_.pop_back();  // release one item from the pipeline
}

The problem is, that each time one of the connections dispatches asynchronously we reset free_req_release_weight to 0. (free_req_release_weight is a thread local).

So it could be the case that a workload dispatches a lot of commands async so the cache grows big enough but then we can end up in this weird corner case/loop:

Let's say we have n connections:

  1. n - 1 connections dispatch synchronously and the call ShrinkToPipelinePool does nothing (we need n connections to shrink)
  2. Only one connection dispatches asynchronously -> resets free_req_release_weight to 0 -> now we need to reach n connections to shrink the cache

Endless loop between 1-2 depending on the workload. From my understanding only one connection out of n must do this and we won't shrink as long as this keep happening.

On a large connection pool, only one "bad actor" will cause this endless loop and I guess the probability of this happening increases with the number of connections as it's more probably that at least one will dispatch async.

@romange
Copy link
Collaborator

romange commented Jan 23, 2025

if (free_req_release_weight > stats_->num_conns) is used to pace the shrinkage. but we never check how many items we actually use in the pool. another approach could be is to use a watermark, i.e. periodically measure the maximum number of pending items across some period, or equivalently the minimum number of items left in pipeline_req_pool_ during some period, and if that number is greater than 0, i.e. the pool was not fully utilised, we pop an element from it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants