-
-
Notifications
You must be signed in to change notification settings - Fork 474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EXPERIMENTAL] Reduced way node index #1058
Conversation
I maintain many |
@rory If you don't mind living on the edge (including risking data corruption and having to do a reimport), you could help by stress testing this in a real life scenario (aka a minutely update planet). This PR is in theory fully functional but it wold help to confirm that also in practice no update data gets lost or updates suddenly become slow because there are odd corner cases. |
|
Usually a simple |
VACUUM ANALYZE is needed to populate the FSM and some other table statistics. If we're only running ANALYZE, we should fix it. |
Improved version of osm2pgsql-dev#1058 Needs more testing.
Closing in favour of follow-up PR #1271. |
Improved version of osm2pgsql-dev#1058 Needs more testing.
OSM ways have a locality property that we can use to reduce the index size: they are often made up of sequential node ids. If node N is contained in the way, then there is a good chance that N+1, N+2, ... are contained in it as well. Thus, if we group nearby nodes and create an index from node groups to ways, the index will be significantly smaller. The drawback is that a lookup in such an index returns false positives, i.e. ways that do not contain the node of interest. So the smaller index is paid for with a performance loss for updates. "Grouping" the ids happens by shifting the id a few bits to the right. How many exactly can be configured with the OSM2PGSQL_WAY_NODE_INDEX_ID_SHIFT environment variable. This commit still sets the default shift for the node ids to 0, so it is completely backwards compatible. Users can set a different shift using the environment. See docs/bucket-index.md for details. This is an improved version of osm2pgsql-dev#1058
OSM ways have a locality property that we can use to reduce the size of the index looking up ways a node is in: they are often made up of sequential node ids. If node N is contained in the way, then there is a good chance that N+1, N+2, ... are contained in it as well. Thus, if we group nearby nodes and create an index from node groups to ways, the index will be significantly smaller. The drawback is that a lookup in such an index returns false positives, i.e. ways that do not contain the node of interest. So the smaller index is paid for with a performance loss for updates. "Grouping" the ids happens by shifting the id a few bits to the right. How many exactly can be configured with the OSM2PGSQL_WAY_NODE_INDEX_ID_SHIFT environment variable. This commit sets the default shift for the node ids to 0, i.e. no shift, so it is completely backwards compatible. Users can set a different shift using the environment. See docs/bucket-index.md for details. Setting the shift to something like 4 or 5 can significantly reduce the disk space needed (saves something like 200 GB on a full planet), but it costs some performance on updates (they are about 30% slower). This is an improved version of osm2pgsql-dev#1058
OSM ways have a locality property that we can use to reduce the size of the index looking up ways a node is in: they are often made up of sequential node ids. If node N is contained in the way, then there is a good chance that N+1, N+2, ... are contained in it as well. Thus, if we group nearby nodes and create an index from node groups to ways, the index will be significantly smaller. The drawback is that a lookup in such an index returns false positives, i.e. ways that do not contain the node of interest. So the smaller index is paid for with a performance loss for updates. "Grouping" the ids happens by shifting the id a few bits to the right. How many exactly can be configured with the OSM2PGSQL_WAY_NODE_INDEX_ID_SHIFT environment variable. This commit sets the default shift for the node ids to 0, i.e. no shift, so it is completely backwards compatible. Users can set a different shift using the environment. See docs/bucket-index.md for details. Setting the shift to something like 4 or 5 can significantly reduce the disk space needed (saves something like 200 GB on a full planet), but it costs some performance on updates (they are about 30% slower). This is an improved version of osm2pgsql-dev#1058
OSM ways have a locality property that we can use to reduce the size of the index looking up ways a node is in: they are often made up of sequential node ids. If node N is contained in the way, then there is a good chance that N+1, N+2, ... are contained in it as well. Thus, if we group nearby nodes and create an index from node groups to ways, the index will be significantly smaller. The drawback is that a lookup in such an index returns false positives, i.e. ways that do not contain the node of interest. So the smaller index is paid for with a performance loss for updates. "Grouping" the ids happens by shifting the id a few bits to the right. How many exactly can be configured with the --middle-way-node-index-id-shift option. This commit sets the default shift for the node ids to 0, i.e. no shift, so it is completely backwards compatible. Users can set a different shift using the environment. See docs/bucket-index.md for details. Setting the shift to something like 4 or 5 can significantly reduce the disk space needed (saves something like 200 GB on a full planet), but it costs some performance on updates (they are about 30% slower). This is an improved version of osm2pgsql-dev#1058
OSM ways have a locality property that we can use to reduce the size of the index looking up ways a node is in: they are often made up of sequential node ids. If node N is contained in the way, then there is a good chance that N+1, N+2, ... are contained in it as well. Thus, if we group nearby nodes and create an index from node groups to ways, the index will be significantly smaller. The drawback is that a lookup in such an index returns false positives, i.e. ways that do not contain the node of interest. So the smaller index is paid for with a performance loss for updates. "Grouping" the ids happens by shifting the id a few bits to the right. How many exactly can be configured with the --middle-way-node-index-id-shift option. This commit sets the default shift for the node ids to 0, i.e. no shift, so it is completely backwards compatible. Users can set a different shift using the environment. See docs/bucket-index.md for details. Setting the shift to something like 4 or 5 can significantly reduce the disk space needed (saves something like 200 GB on a full planet), but it costs some performance on updates (they are about 30% slower). This is an improved version of osm2pgsql-dev#1058
Following #1045, I've been experimenting with reducing the size of the way-node index by creating a lossy index.
OSM ways have a locality property that we can use to reduce the index size: they are often made up of sequential node ids. If node N is contained in the way, then there is a good chance that N+1, N+2, ... are contained in it as well. Thus, if we group nearby nodes and create an index from node groups to ways, the index will be significantly smaller. The drawback is that a lookup in such an index returns false positives, i.e. ways that do not contain the node of interest. So the smaller index is paid for with a performance loss for updates.
How it works
Creating buckets from node IDs is simple. Just divide through the expected bucket size, i.e. for bucket size 32: node_bucket = node_id/32. To get all buckets that a way has nodes in, take all nodes, divide through bucket size and remove the duplicates. In SQL that works like this:
We could create an extra column in the place_osm_ways table to save our bucket array. As it turns out (thank you Stackoverflow), this is not really necessary. We can simply create a function that returns a semi-virtual column with our values:
As long as the function is immutable, we can even create an index over that function:
Now we can look up which ways a node is a member by looking up the bucket and the node:
This uses our shiny new index:
Performance
I've tested this with a recent planet against a database using the
null
output for bucket sizes 16.32 and 64."Average number of ways retrieved" is a rough estimate of how many ways need to be loaded from planet_osm_ways for each actual way result we expect.
The updates times are computed by running the same day update 4 times and throwing away the first result. So they are on a hot database and thus on the optimistic side. On a cold database I've seen up to 70% increase in the time. However, it should be noted that all this is done on a null output. Given that the update of the output tables is the dominating factor for updates, I'd expect a lower overall impact for a real world update of, say, an openstreetmap-carto database.
TODO
We need proper tests for the invalidation first. I got the query wrong at the first try and the tests still past happily.
The bucket indexes still have the same issue with the query planner as described in #1045. So we likely still run into jit/parallel worker slowness.
The queries only started to be reasonably fast after a full 'VACUUM ANALZE' on the planet_osm_ways table. This may be because I created the indexes on top of an existing database. It may also be that the vacuum is always necessary.