Skip to content

Way node index using shifted node ids #1275

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 8, 2020

Conversation

joto
Copy link
Collaborator

@joto joto commented Sep 23, 2020

OSM ways have a locality property that we can use to reduce the size of
the index looking up ways a node is in: they are often made up of
sequential node ids. If node N is contained in the way, then there is a
good chance that N+1, N+2, ... are contained in it as well. Thus, if we
group nearby nodes and create an index from node groups to ways, the
index will be significantly smaller. The drawback is that a lookup in
such an index returns false positives, i.e. ways that do not contain the
node of interest. So the smaller index is paid for with a performance
loss for updates.

"Grouping" the ids happens by shifting the id a few bits to the right.
How many exactly can be configured with the
OSM2PGSQL_WAY_NODE_INDEX_ID_SHIFT environment variable.

This commit sets the default shift for the node ids to 0, i.e. no shift,
so it is completely backwards compatible. Users can set a different
shift using the environment. See docs/bucket-index.md for details.

Setting the shift to something like 4 or 5 can significantly reduce the
disk space needed (saves something like 200 GB on a full planet), but
it costs some performance on updates (they are about 30% slower).

This is an improved version of
#1058

@joto
Copy link
Collaborator Author

joto commented Sep 23, 2020

Two things to consider in this PR:

  • This introduces for the first time a configuration option through an environment variable (OSM2PGSQL_WAY_NODE_INDEX_ID_SHIFT) The idea is that this isn't something normal users will ever need, but a setting for specialists or developers who want to debug this. I would imagine that there will be other obscure settings like this in the future. Another way to handle this would be to add a command line option similar to the '-f' of compilers that allows setting these kinds of options without having more and command line options.
  • This PR adds the functionality for the new index, but does't use it by default. If this is sucessful a later PR will change the default setting from "disabled" to a good default (probably around 5). This way we can introduce the change in master and give people the chance to test it easily without forcing anybody who runs master to think about it before they are ready.

@joto joto mentioned this pull request Sep 23, 2020
Copy link
Collaborator

@lonvia lonvia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. We need to have a discussion if we want it switched on by default. In my opinion, the significant disk saving more than outweighs the slower import. But lets defer this until after the next release.

@pnorman Any objection against the use of environment variables for configuring the feature? This is meant as a measure to keep the number of parameters in check.

OSM ways have a locality property that we can use to reduce the size of
the index looking up ways a node is in: they are often made up of
sequential node ids. If node N is contained in the way, then there is a
good chance that N+1, N+2, ... are contained in it as well. Thus, if we
group nearby nodes and create an index from node groups to ways, the
index will be significantly smaller. The drawback is that a lookup in
such an index returns false positives, i.e. ways that do not contain the
node of interest. So the smaller index is paid for with a performance
loss for updates.

"Grouping" the ids happens by shifting the id a few bits to the right.
How many exactly can be configured with the
--middle-way-node-index-id-shift option.

This commit sets the default shift for the node ids to 0, i.e. no shift,
so it is completely backwards compatible. Users can set a different
shift using the environment. See docs/bucket-index.md for details.

Setting the shift to something like 4 or 5 can significantly reduce the
disk space needed (saves something like 200 GB on a full planet), but
it costs some performance on updates (they are about 30% slower).

This is an improved version of
osm2pgsql-dev#1058
@joto
Copy link
Collaborator Author

joto commented Oct 6, 2020

Force pushed to rebase on master.

Now uses command line option --middle-way-node-index-id-shift instead of the environment variable OSM2PGSQL_WAY_NODE_INDEX_ID_SHIFT.

@mboeringa
Copy link

mboeringa commented Oct 6, 2020

Force pushed to rebase on master.

Now uses command line option --middle-way-node-index-id-shift instead of the environment variable OSM2PGSQL_WAY_NODE_INDEX_ID_SHIFT.

@joto ,

While the name of the command line option may be an accurate description of what the option does, it is rather unhelpful to the novice user of osm2pgsql...

Wouldn't it be a better idea to call it something like:

--save-node-index-space
--reduced-node-index

or something similar, and then describe in the Help pages what it actually does? Even for the current
--middle-way-node-index-id-shift
it will need a decent explanation of what it does and why its useful, at least describing it reduces temporary disk space usage at the cost of slightly slower imports.

@joto
Copy link
Collaborator Author

joto commented Oct 6, 2020

@mboeringa The ugly name is on purpose. Novice users should never ever set this, this is for experts only.

@mboeringa
Copy link

mboeringa commented Oct 7, 2020

@mboeringa The ugly name is on purpose. Novice users should never ever set this, this is for experts only.

@joto, another question regarding this: is the 30% performance hit you report only for true osm2pgsql guided updates, so this performance hit is not relevant to users who do not to do updates but instead a full re-import to update their database?

If so, I would argue that it is more likely that novice users are in the second group, only doing full re-imports. Setting up osm2pgsql to do updates is likely the "advanced" option.

If that is the case, this option may actually be desirable for novice users, as it would reduce temporary disk space usage at no to little cost, and a slightly more user-friendly name an option.

Of course, if ordinary imports are affected as well with a 30% performance hit, then I agree this is more of an "advanced" option that should likely not be default.

@lonvia lonvia merged commit e3bd89b into osm2pgsql-dev:master Oct 8, 2020
@mboeringa
Copy link

@joto , I now actually realized I mis-interpreted the '--middle-way-node-index-id-shift' command line option. I thought this was just an on/off switch for using the more space efficient index storage. However, I now see in the code it is actually directly used as a parameter for tweaking the option / shift.

As such, the chosen name is suitable, and my comments off the mark.

@joto joto deleted the way-node-index branch October 11, 2020 09:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants