You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While investigating a datastore rss vs used memory gap we saw that
Dragonfly with high pipeline cache byets is high, not that dispatch queue was 0
pipeline_cache_bytes:3756692020
dispatch_queue_bytes:0
@kostasrim initial investigation leads:
We have a corner case that can be described as:
We call ShrinkPipelinePool() that "gradually releases the pipeline messages in the cache" and the way we do that is:
if (free_req_release_weight > stats_->num_conns) {
//blah blah
pipeline_req_pool_.pop_back(); // release one item from the pipeline
}
The problem is, that each time one of the connections dispatches asynchronously we reset free_req_release_weight to 0. (free_req_release_weight is a thread local).
So it could be the case that a workload dispatches a lot of commands async so the cache grows big enough but then we can end up in this weird corner case/loop:
Let's say we have n connections:
n - 1 connections dispatch synchronously and the call ShrinkToPipelinePool does nothing (we need n connections to shrink)
Only one connection dispatches asynchronously -> resets free_req_release_weight to 0 -> now we need to reach n connections to shrink the cache
Endless loop between 1-2 depending on the workload. From my understanding only one connection out of n must do this and we won't shrink as long as this keep happening.
On a large connection pool, only one "bad actor" will cause this endless loop and I guess the probability of this happening increases with the number of connections as it's more probably that at least one will dispatch async.
The text was updated successfully, but these errors were encountered:
if (free_req_release_weight > stats_->num_conns) is used to pace the shrinkage. but we never check how many items we actually use in the pool. another approach could be is to use a watermark, i.e. periodically measure the maximum number of pending items across some period, or equivalently the minimum number of items left in pipeline_req_pool_ during some period, and if that number is greater than 0, i.e. the pool was not fully utilised, we pop an element from it.
While investigating a datastore rss vs used memory gap we saw that
Dragonfly with high pipeline cache byets is high, not that dispatch queue was 0
pipeline_cache_bytes:3756692020
dispatch_queue_bytes:0
@kostasrim initial investigation leads:
We have a corner case that can be described as:
We call
ShrinkPipelinePool()
that "gradually releases the pipeline messages in the cache" and the way we do that is:The problem is, that each time one of the connections dispatches asynchronously we reset
free_req_release_weight
to0
.(free_req_release_weight is a thread local).
So it could be the case that a workload dispatches a lot of commands async so the cache grows big enough but then we can end up in this weird corner case/loop:
Let's say we have
n
connections:n - 1
connections dispatch synchronously and the callShrinkToPipelinePool
does nothing (we needn
connections to shrink)free_req_release_weight to 0
-> now we need to reachn
connections to shrink the cacheEndless loop between 1-2 depending on the workload. From my understanding only one connection out of n must do this and we won't shrink as long as this keep happening.
On a large connection pool, only one "bad actor" will cause this endless loop and I guess the probability of this happening increases with the number of connections as it's more probably that at least one will dispatch async.
The text was updated successfully, but these errors were encountered: