Way node index using shifted node ids #1275

joto · 2020-09-23T13:26:30Z

OSM ways have a locality property that we can use to reduce the size of
the index looking up ways a node is in: they are often made up of
sequential node ids. If node N is contained in the way, then there is a
good chance that N+1, N+2, ... are contained in it as well. Thus, if we
group nearby nodes and create an index from node groups to ways, the
index will be significantly smaller. The drawback is that a lookup in
such an index returns false positives, i.e. ways that do not contain the
node of interest. So the smaller index is paid for with a performance
loss for updates.

"Grouping" the ids happens by shifting the id a few bits to the right.
How many exactly can be configured with the
OSM2PGSQL_WAY_NODE_INDEX_ID_SHIFT environment variable.

This commit sets the default shift for the node ids to 0, i.e. no shift,
so it is completely backwards compatible. Users can set a different
shift using the environment. See docs/bucket-index.md for details.

Setting the shift to something like 4 or 5 can significantly reduce the
disk space needed (saves something like 200 GB on a full planet), but
it costs some performance on updates (they are about 30% slower).

This is an improved version of
#1058

joto · 2020-09-23T13:33:38Z

Two things to consider in this PR:

This introduces for the first time a configuration option through an environment variable (OSM2PGSQL_WAY_NODE_INDEX_ID_SHIFT) The idea is that this isn't something normal users will ever need, but a setting for specialists or developers who want to debug this. I would imagine that there will be other obscure settings like this in the future. Another way to handle this would be to add a command line option similar to the '-f' of compilers that allows setting these kinds of options without having more and command line options.
This PR adds the functionality for the new index, but does't use it by default. If this is sucessful a later PR will change the default setting from "disabled" to a good default (probably around 5). This way we can introduce the change in master and give people the chance to test it easily without forcing anybody who runs master to think about it before they are ready.

lonvia

Looks good to me. We need to have a discussion if we want it switched on by default. In my opinion, the significant disk saving more than outweighs the slower import. But lets defer this until after the next release.

@pnorman Any objection against the use of environment variables for configuring the feature? This is meant as a measure to keep the number of parameters in check.

OSM ways have a locality property that we can use to reduce the size of the index looking up ways a node is in: they are often made up of sequential node ids. If node N is contained in the way, then there is a good chance that N+1, N+2, ... are contained in it as well. Thus, if we group nearby nodes and create an index from node groups to ways, the index will be significantly smaller. The drawback is that a lookup in such an index returns false positives, i.e. ways that do not contain the node of interest. So the smaller index is paid for with a performance loss for updates. "Grouping" the ids happens by shifting the id a few bits to the right. How many exactly can be configured with the --middle-way-node-index-id-shift option. This commit sets the default shift for the node ids to 0, i.e. no shift, so it is completely backwards compatible. Users can set a different shift using the environment. See docs/bucket-index.md for details. Setting the shift to something like 4 or 5 can significantly reduce the disk space needed (saves something like 200 GB on a full planet), but it costs some performance on updates (they are about 30% slower). This is an improved version of osm2pgsql-dev#1058

joto · 2020-10-06T15:48:16Z

Force pushed to rebase on master.

Now uses command line option --middle-way-node-index-id-shift instead of the environment variable OSM2PGSQL_WAY_NODE_INDEX_ID_SHIFT.

mboeringa · 2020-10-06T17:48:26Z

Force pushed to rebase on master.

Now uses command line option --middle-way-node-index-id-shift instead of the environment variable OSM2PGSQL_WAY_NODE_INDEX_ID_SHIFT.

@joto ,

While the name of the command line option may be an accurate description of what the option does, it is rather unhelpful to the novice user of osm2pgsql...

Wouldn't it be a better idea to call it something like:

--save-node-index-space
--reduced-node-index

or something similar, and then describe in the Help pages what it actually does? Even for the current
--middle-way-node-index-id-shift
it will need a decent explanation of what it does and why its useful, at least describing it reduces temporary disk space usage at the cost of slightly slower imports.

joto · 2020-10-06T21:28:37Z

@mboeringa The ugly name is on purpose. Novice users should never ever set this, this is for experts only.

mboeringa · 2020-10-07T06:55:14Z

@mboeringa The ugly name is on purpose. Novice users should never ever set this, this is for experts only.

@joto, another question regarding this: is the 30% performance hit you report only for true osm2pgsql guided updates, so this performance hit is not relevant to users who do not to do updates but instead a full re-import to update their database?

If so, I would argue that it is more likely that novice users are in the second group, only doing full re-imports. Setting up osm2pgsql to do updates is likely the "advanced" option.

If that is the case, this option may actually be desirable for novice users, as it would reduce temporary disk space usage at no to little cost, and a slightly more user-friendly name an option.

Of course, if ordinary imports are affected as well with a 30% performance hit, then I agree this is more of an "advanced" option that should likely not be default.

mboeringa · 2020-10-10T08:35:20Z

@joto , I now actually realized I mis-interpreted the '--middle-way-node-index-id-shift' command line option. I thought this was just an on/off switch for using the more space efficient index storage. However, I now see in the code it is actually directly used as a parameter for tweaking the option / shift.

As such, the chosen name is suitable, and my comments off the mark.

joto mentioned this pull request Sep 23, 2020

Draft bucket index #1271

Closed

lonvia approved these changes Sep 23, 2020

View reviewed changes

joto force-pushed the way-node-index branch from 7f4e9c6 to 5b05083 Compare September 23, 2020 13:50

joto force-pushed the way-node-index branch from 5b05083 to 79c1396 Compare October 5, 2020 13:40

joto force-pushed the way-node-index branch from 79c1396 to 2d8f5fa Compare October 6, 2020 15:46

lonvia merged commit e3bd89b into osm2pgsql-dev:master Oct 8, 2020

joto deleted the way-node-index branch October 11, 2020 09:02

lonvia mentioned this pull request Jun 26, 2022

Enable lossy way-node indexes by default? #1687

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Way node index using shifted node ids #1275

Way node index using shifted node ids #1275

Uh oh!

joto commented Sep 23, 2020

Uh oh!

joto commented Sep 23, 2020

Uh oh!

lonvia left a comment

Uh oh!

joto commented Oct 6, 2020

Uh oh!

mboeringa commented Oct 6, 2020 •

edited

Loading

Uh oh!

joto commented Oct 6, 2020

Uh oh!

mboeringa commented Oct 7, 2020 •

edited

Loading

Uh oh!

mboeringa commented Oct 10, 2020

Uh oh!

Uh oh!

Uh oh!

Way node index using shifted node ids #1275

Way node index using shifted node ids #1275

Uh oh!

Conversation

joto commented Sep 23, 2020

Uh oh!

joto commented Sep 23, 2020

Uh oh!

lonvia left a comment

Choose a reason for hiding this comment

Uh oh!

joto commented Oct 6, 2020

Uh oh!

mboeringa commented Oct 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joto commented Oct 6, 2020

Uh oh!

mboeringa commented Oct 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mboeringa commented Oct 10, 2020

Uh oh!

Uh oh!

mboeringa commented Oct 6, 2020 •

edited

Loading

mboeringa commented Oct 7, 2020 •

edited

Loading