-
Notifications
You must be signed in to change notification settings - Fork 939
Optimize zset memory usage by embedding element in skiplist #2508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Conversation
Signed-off-by: chzhoo <[email protected]>
Signed-off-by: chzhoo <[email protected]>
Signed-off-by: chzhoo <[email protected]>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #2508 +/- ##
============================================
+ Coverage 72.01% 72.24% +0.22%
============================================
Files 126 126
Lines 70448 70683 +235
============================================
+ Hits 50734 51063 +329
+ Misses 19714 19620 -94
🚀 New features to boost your workflow:
|
|
I like the direction of this. I would like, however, to share some of my initial thought on that: Now I like your solution as it is optimal in memory saving. I would suggest to consider extending the entry as I initially did. @zuiderkwast / @madolson what are you thoughts? |
|
@ranshid Is the idea to use |
Co-authored-by: Viktor Söderqvist <[email protected]> Signed-off-by: chzhoo <[email protected]>
Yes. that was my point as well, but I wanted to point out this option so that we can take a correct decision. I am fine with going with the suggested solution. Will find the time to review the details. |
Signed-off-by: chzhoo <[email protected]>
|
@vitarb You might be interested in reviewing this. |
|
@chzhoo I agree with this approach. 👍
Regarding this (future optimization idea) I wouldn't worry too much about it, unless you have a smart idea that I haven't though about yet(?). The height of the node is usually low, like 3/4 of the nodes have only one level, 3/4 of the remaining nodes have only two levels, and so forth. For nodes with only one level, we the score, backward-pointer, forward-pointer and span make up 32 bytes. We have 32 bytes left for the element in this cache line. It means we can fit an element of length 29 (excluding the sds header and nul-terminator). For nodes with two levels, we can fit an element of length 13 bytes. I think it seems acceptable, unless you have another smart idea. :) |
The memory layout of A memory layout with potential performance improvements,referencing RocksDB's inline-skiplist: The reason I believe the second layout holds potential for optimization lies in the process of accessing the
The above is merely my hypothesis. I will further investigate and validate this aspect, and look forward to continuing this discussion in a separate PR. @zuiderkwast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have skimmed though this change. I didn't do a very careful review of all the details, but I believe this PR looks great and can soon be ready to be merged.
If this change looks acceptable, I will look into implementing some tests in the next round.
Yes, please go ahead. What tests do you have in mind?
I do believe we have existing tests for sorted sets that will cover this code.
I've looked into the test coverage, and you're correct that existing tests for sorted sets cover the most of the updated code. The one exception is the updated code in |
Signed-off-by: chzhoo <[email protected]>
Signed-off-by: chzhoo <[email protected]>
The unit tests have been submitted, and the code is now in a review-ready state. |
| zslSetNodeHeight(zn, height); | ||
| if (ele) { | ||
| char *data = ((char *)(zn + 1)) + height * sizeof(struct zskiplistLevel); | ||
| *data++ = sdsHdrSize(ele_sds_type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: why do we need to maintain sds header size? can we just keep all pointers to the sds iteself and from there we can always reach the skiplist data... I think the most reasonable layout is?:
+------------------+--------+------+----------+------+---------------+
+ backward-pointer | level-N | ... | level-0 | score | element-sds |
+------------------+--------+------+----------+------+---------------+
|
|
Node ptr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ranshid I agree this is probably the more optimal representation, but it requires larger code changes. We'll need to change all the direct accesses to the zskiplistNode structs such as node->level[i]. There are many such accesses in the code. Thus, this can be the refinement to the "future optimization ideas" outlined in #2508 (comment).
For this simple PR, I think one byte for the header size is acceptable. We still get a lot of optimization without a large code change. I'm not even sure we will need the "future optimization" at all. It might not be worth the effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ranshid I agree this is probably the more optimal representation, but it requires larger code changes. We'll need to change all the direct accesses to the
zskiplistNodestructs such asnode->level[i]. There are many such accesses in the code. Thus, this can be the refinement to the "future optimization ideas" outlined in #2508 (comment).For this simple PR, I think one byte for the header size is acceptable. We still get a lot of optimization without a large code change. I'm not even sure we will need the "future optimization" at all. It might not be worth the effort.
@zuiderkwast I agree this will require more changes. I also think what I suggested is slightly different than the suggestion in #2508 (comment) as this suggestion also save the extra byte used to calculate the sds header size (which I agree is minor). but if we ARE targeting this for 9.1 I am not sure why we cannot put the extra effort to build this as we think would be best.
Given said that, we can do this in 2 stages even for 9.1. We can first introduce the naïve change and also followup with a second PR to optimize it. I would still think we need to have a performance benchmark results so we can evaluate the potential impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think what I suggested is slightly different than the suggestion in #2508
Yes that's right.
if we ARE targeting this for 9.1 I am not sure why we cannot put the extra effort to build this as we think would be best
We can, but it's just a large effort for only one byte. I'm not sure if it's worth it. If you want to review it, fine. I just think first PR is small and achieves most of the benefit.
I would still think we need to have a performance benchmark results so we can evaluate the potential impact.
Sure. I can guess the outcome: No performance impact, except for insertion of very large elements. In this case we need to copy large strings.
I don't know if we care much about very large strings in sorted sets. Do we? For key names and hash fields, we don't. We already embed them. Using large keys names is an anti-pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, but it's just a large effort for only one byte. I'm not sure if it's worth it. If you want to review it, fine. I just think first PR is small and achieves most of the benefit.
I would still think we need to have a performance benchmark results so we can evaluate the potential impact.
Sure. I can guess the outcome: No performance impact, except for insertion of very large elements. In this case we need to copy large strings.
I don't know if we care much about very large strings in sorted sets. Do we? For key names and hash fields, we don't. We already embed them. Using large keys names is an anti-pattern.
I agree there is no expected impact, but some things (like the fact that the score is drifted away from the sds pointer) might have some relevancy although it might even up with the embedded sds improvement. I just think that we need to take care about presenting potential regressions during development since later it makes the bisecting job much more difficult.
Also in this case there is no "keyname". The zset elements are practically values of something and might actually be large. For example very long URLs... In AWS I know sometimes it is used to store ARNs which can reach 2K long...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given said that, we can do this in 2 stages even for 9.1. We can first introduce the naïve change and also followup with a second PR to optimize it. I would still think we need to have a performance benchmark results so we can evaluate the potential impact.
- Are you concerned about a significant performance drop in
ZADDcommand? I can run benchmark tests later to check. - I'm not very familiar with the PR merging process yet. If this PR is targeting version 9.1, will it take a relatively long time to get merged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you concerned about a significant performance drop in
ZADDcommand? I can run benchmark tests later to check.
It's one of the concerns yes, for large element sizes such as 2KB. Let's run a benchmark with this setup. It's possible to specify commands explicitly on the valkey-benchmark command line.
Additionally, running some benchmark for various z-commands like zscore and zrange to verify that they don't get slower. @ranshid can you line out what specific benchmarks you would want to see? I think we can do it manually rather than waiting for the automated benchmarking to be in place. I don't like PRs hanging indefinitely.
I'm not very familiar with the PR merging process yet. If this PR is targeting version 9.1, will it take a relatively long time to get merged?
We can merge this PR whenever we feel that it's ready. We don't have a feature freeze for 9.0, because we have already created the 9.0 branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Posted a few comments.
It's a bit late to get this into 9.0, but we can get it into 9.1 instead. Added it to the plan.
Regarding the benchmarks, I believe the memory saving is exactly 7 bytes per element, regardless of other factors, right? (remove one pointer and add one byte for sds hdr size). The relative savings then very much depends on the size of the elements then, not very much the number of elements in a sorted set. I can see in your Lua script you used 12 bytes elements in all tests.
Co-authored-by: Viktor Söderqvist <[email protected]> Signed-off-by: chzhoo <[email protected]>
Co-authored-by: Viktor Söderqvist <[email protected]> Signed-off-by: chzhoo <[email protected]>
Co-authored-by: Viktor Söderqvist <[email protected]> Signed-off-by: chzhoo <[email protected]>
Yes, you are right. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some small comments
|
Benchmark ran on this commit: Benchmark Comparison by CommandGET | cluster disabled | tls disabled
GET | cluster enabled | tls disabled
MGET | cluster disabled | tls disabled
MSET | cluster disabled | tls disabled
SET | cluster disabled | tls disabled
SET | cluster enabled | tls disabled
|
Oh... no zset benchmark yet :( |
Signed-off-by: chzhoo <[email protected]>
|
@ranshid its easier to add. It's only a few lines and we can re-run the test. |
@madolson depends on how you refer to "few lines"... So I feel some things needs to be done:
I will try to evaluate if I can simply so that or I will open up issues to followup on that |
@madolson indeed few line changes. take a look at: |
Background
By default, when the number of elements in a zset exceeds 128, the underlying data structure adopts a skiplist. We can reduce memory usage by embedding elements into the skiplist. Change the
zskiplistNodememory layout as follows:Benchmark step
I generated the test data using the following lua script && cli command. And check memory usage using the
infocommand.lua script
valkey-cli command
valkey-cli EVAL "$(catcreate_zsets.lua)" 0 0 100000 ${ZSET_ELEMENT_NUM}Benchmark result
Todo
PTAL. If this change looks acceptable, I will look into implementing some tests in the next round.