-
-
Notifications
You must be signed in to change notification settings - Fork 477
Way node index using shifted node ids #1275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Two things to consider in this PR:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. We need to have a discussion if we want it switched on by default. In my opinion, the significant disk saving more than outweighs the slower import. But lets defer this until after the next release.
@pnorman Any objection against the use of environment variables for configuring the feature? This is meant as a measure to keep the number of parameters in check.
7f4e9c6
to
5b05083
Compare
OSM ways have a locality property that we can use to reduce the size of the index looking up ways a node is in: they are often made up of sequential node ids. If node N is contained in the way, then there is a good chance that N+1, N+2, ... are contained in it as well. Thus, if we group nearby nodes and create an index from node groups to ways, the index will be significantly smaller. The drawback is that a lookup in such an index returns false positives, i.e. ways that do not contain the node of interest. So the smaller index is paid for with a performance loss for updates. "Grouping" the ids happens by shifting the id a few bits to the right. How many exactly can be configured with the --middle-way-node-index-id-shift option. This commit sets the default shift for the node ids to 0, i.e. no shift, so it is completely backwards compatible. Users can set a different shift using the environment. See docs/bucket-index.md for details. Setting the shift to something like 4 or 5 can significantly reduce the disk space needed (saves something like 200 GB on a full planet), but it costs some performance on updates (they are about 30% slower). This is an improved version of osm2pgsql-dev#1058
Force pushed to rebase on master. Now uses command line option |
@joto , While the name of the command line option may be an accurate description of what the option does, it is rather unhelpful to the novice user of osm2pgsql... Wouldn't it be a better idea to call it something like: --save-node-index-space or something similar, and then describe in the Help pages what it actually does? Even for the current |
@mboeringa The ugly name is on purpose. Novice users should never ever set this, this is for experts only. |
@joto, another question regarding this: is the 30% performance hit you report only for true osm2pgsql guided updates, so this performance hit is not relevant to users who do not to do updates but instead a full re-import to update their database? If so, I would argue that it is more likely that novice users are in the second group, only doing full re-imports. Setting up osm2pgsql to do updates is likely the "advanced" option. If that is the case, this option may actually be desirable for novice users, as it would reduce temporary disk space usage at no to little cost, and a slightly more user-friendly name an option. Of course, if ordinary imports are affected as well with a 30% performance hit, then I agree this is more of an "advanced" option that should likely not be default. |
@joto , I now actually realized I mis-interpreted the '--middle-way-node-index-id-shift' command line option. I thought this was just an on/off switch for using the more space efficient index storage. However, I now see in the code it is actually directly used as a parameter for tweaking the option / shift. As such, the chosen name is suitable, and my comments off the mark. |
OSM ways have a locality property that we can use to reduce the size of
the index looking up ways a node is in: they are often made up of
sequential node ids. If node N is contained in the way, then there is a
good chance that N+1, N+2, ... are contained in it as well. Thus, if we
group nearby nodes and create an index from node groups to ways, the
index will be significantly smaller. The drawback is that a lookup in
such an index returns false positives, i.e. ways that do not contain the
node of interest. So the smaller index is paid for with a performance
loss for updates.
"Grouping" the ids happens by shifting the id a few bits to the right.
How many exactly can be configured with the
OSM2PGSQL_WAY_NODE_INDEX_ID_SHIFT environment variable.
This commit sets the default shift for the node ids to 0, i.e. no shift,
so it is completely backwards compatible. Users can set a different
shift using the environment. See docs/bucket-index.md for details.
Setting the shift to something like 4 or 5 can significantly reduce the
disk space needed (saves something like 200 GB on a full planet), but
it costs some performance on updates (they are about 30% slower).
This is an improved version of
#1058