Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EXPERIMENTAL] Reduced way node index #1058

Closed
wants to merge 3 commits into from

Conversation

lonvia
Copy link
Collaborator

@lonvia lonvia commented Jan 11, 2020

Following #1045, I've been experimenting with reducing the size of the way-node index by creating a lossy index.

OSM ways have a locality property that we can use to reduce the index size: they are often made up of sequential node ids. If node N is contained in the way, then there is a good chance that N+1, N+2, ... are contained in it as well. Thus, if we group nearby nodes and create an index from node groups to ways, the index will be significantly smaller. The drawback is that a lookup in such an index returns false positives, i.e. ways that do not contain the node of interest. So the smaller index is paid for with a performance loss for updates.

How it works

Creating buckets from node IDs is simple. Just divide through the expected bucket size, i.e. for bucket size 32: node_bucket = node_id/32. To get all buckets that a way has nodes in, take all nodes, divide through bucket size and remove the duplicates. In SQL that works like this:

ARRAY(SELECT DISTINCT unnest(nodes)/32)

We could create an extra column in the place_osm_ways table to save our bucket array. As it turns out (thank you Stackoverflow), this is not really necessary. We can simply create a function that returns a semi-virtual column with our values:

   CREATE FUNCTION bucket_32(planet_osm_ways) RETURNS bigint[] AS
   $func$
	   SELECT array(select distinct unnest(nodes)/32)
	   FROM   planet_osm_ways b
	   WHERE  b.id = $1.id
  $func$ LANGUAGE SQL IMMUTABLE;

As long as the function is immutable, we can even create an index over that function:

CREATE INDEX planet_osm_bucket_32 on planet_osm_ways using gin(bucket_32(planet_osm_ways));

Now we can look up which ways a node is a member by looking up the bucket and the node:

  SELECT id FROM planet_osm_ways w
  WHERE 32001039= any(w.nodes) and ARRAY[32001039::bigint/32]] && w.bucket_32;

This uses our shiny new index:

QUERY PLAN                                                             
--------------------

 Bitmap Heap Scan on planet_osm_ways w  (cost=13.03..402.48 rows=7 width=198) (actual time=0.045..0.063 rows=2 loops=1)
   Recheck Cond: ('{1000032}'::bigint[] && bucket_32(w.*))
   Filter: (32001039 = ANY (nodes))
   Rows Removed by Filter: 8
   Heap Blocks: exact=4
   ->  Bitmap Index Scan on planet_osm_bucket_32  (cost=0.00..13.03 rows=137 width=0) (actual time=0.026..0.026 rows=10 loops=1)
		 Index Cond: ('{1000032}'::bigint[] && bucket_32(w.*))
 Planning Time: 0.579 ms
 Execution Time: 0.113 ms

Performance

I've tested this with a recent planet against a database using the null output for bucket sizes 16.32 and 64.

Bucket size none 16 32 64
Time to create index (hh:mm) 07:30 2:50 2:35 2:25
Index size (GB) 276 29 17 11
Average number of ways retrieved 1 6.3 10.6 18.3
time to update 1 day (s) 361 419 428 431

"Average number of ways retrieved" is a rough estimate of how many ways need to be loaded from planet_osm_ways for each actual way result we expect.

The updates times are computed by running the same day update 4 times and throwing away the first result. So they are on a hot database and thus on the optimistic side. On a cold database I've seen up to 70% increase in the time. However, it should be noted that all this is done on a null output. Given that the update of the output tables is the dominating factor for updates, I'd expect a lower overall impact for a real world update of, say, an openstreetmap-carto database.

TODO

We need proper tests for the invalidation first. I got the query wrong at the first try and the tests still past happily.

The bucket indexes still have the same issue with the query planner as described in #1045. So we likely still run into jit/parallel worker slowness.

The queries only started to be reasonably fast after a full 'VACUUM ANALZE' on the planet_osm_ways table. This may be because I created the indexes on top of an existing database. It may also be that the vacuum is always necessary.

@amandasaurus
Copy link
Contributor

I maintain many osm2pgsql installs, and the large size of the planet_osm_ways_nodes index is a continued, big annoyance for me. So this sort of space saving is fucking fantastic. 😁😁😁 This would be a game changer for me, and would solve so many headaches. Please god make this happen. I'll take the slight slow down if I can save 200GB (!!!!!) of disk space.

@lonvia
Copy link
Collaborator Author

lonvia commented Mar 24, 2020

@rory If you don't mind living on the edge (including risking data corruption and having to do a reimport), you could help by stress testing this in a real life scenario (aka a minutely update planet). This PR is in theory fully functional but it wold help to confirm that also in practice no update data gets lost or updates suddenly become slow because there are odd corner cases.

@pnorman
Copy link
Collaborator

pnorman commented Mar 24, 2020

This may be because I created the indexes on top of an existing database. It may also be that the vacuum is always necessary.

VACUUM ANALYZE should be run after osm2pgsql finishes creating the table and builds the indexes so all the stats are up to date

@lonvia
Copy link
Collaborator Author

lonvia commented Mar 24, 2020

Usually a simple ANALYSE will do for that purpose. VACUUM ANALYSE is rather on the expensive side.

@pnorman
Copy link
Collaborator

pnorman commented Mar 24, 2020

Usually a simple ANALYSE will do for that purpose. VACUUM ANALYSE is rather on the expensive side.

VACUUM ANALYZE is needed to populate the FSM and some other table statistics. If we're only running ANALYZE, we should fix it.

joto added a commit to joto/osm2pgsql that referenced this pull request Sep 3, 2020
Improved version of osm2pgsql-dev#1058

Needs more testing.
@joto joto mentioned this pull request Sep 3, 2020
@lonvia
Copy link
Collaborator Author

lonvia commented Sep 4, 2020

Closing in favour of follow-up PR #1271.

@lonvia lonvia closed this Sep 4, 2020
joto added a commit to joto/osm2pgsql that referenced this pull request Sep 4, 2020
Improved version of osm2pgsql-dev#1058

Needs more testing.
joto added a commit to joto/osm2pgsql that referenced this pull request Sep 22, 2020
OSM ways have a locality property that we can use to reduce the index
size: they are often made up of sequential node ids. If node N is
contained in the way, then there is a good chance that N+1, N+2, ... are
contained in it as well. Thus, if we group nearby nodes and create an
index from node groups to ways, the index will be significantly smaller.
The drawback is that a lookup in such an index returns false positives,
i.e. ways that do not contain the node of interest. So the smaller index
is paid for with a performance loss for updates.

"Grouping" the ids happens by shifting the id a few bits to the right.
How many exactly can be configured with the
OSM2PGSQL_WAY_NODE_INDEX_ID_SHIFT environment variable.

This commit still sets the default shift for the node ids to 0, so it is
completely backwards compatible. Users can set a different shift using
the environment. See docs/bucket-index.md for details.

This is an improved version of
osm2pgsql-dev#1058
joto added a commit to joto/osm2pgsql that referenced this pull request Sep 23, 2020
OSM ways have a locality property that we can use to reduce the size of
the index looking up ways a node is in: they are often made up of
sequential node ids. If node N is contained in the way, then there is a
good chance that N+1, N+2, ... are contained in it as well. Thus, if we
group nearby nodes and create an index from node groups to ways, the
index will be significantly smaller. The drawback is that a lookup in
such an index returns false positives, i.e. ways that do not contain the
node of interest. So the smaller index is paid for with a performance
loss for updates.

"Grouping" the ids happens by shifting the id a few bits to the right.
How many exactly can be configured with the
OSM2PGSQL_WAY_NODE_INDEX_ID_SHIFT environment variable.

This commit sets the default shift for the node ids to 0, i.e. no shift,
so it is completely backwards compatible. Users can set a different
shift using the environment. See docs/bucket-index.md for details.

Setting the shift to something like 4 or 5 can significantly reduce the
disk space needed (saves something like 200 GB on a full planet), but
it costs some performance on updates (they are about 30% slower).

This is an improved version of
osm2pgsql-dev#1058
joto added a commit to joto/osm2pgsql that referenced this pull request Sep 23, 2020
OSM ways have a locality property that we can use to reduce the size of
the index looking up ways a node is in: they are often made up of
sequential node ids. If node N is contained in the way, then there is a
good chance that N+1, N+2, ... are contained in it as well. Thus, if we
group nearby nodes and create an index from node groups to ways, the
index will be significantly smaller. The drawback is that a lookup in
such an index returns false positives, i.e. ways that do not contain the
node of interest. So the smaller index is paid for with a performance
loss for updates.

"Grouping" the ids happens by shifting the id a few bits to the right.
How many exactly can be configured with the
OSM2PGSQL_WAY_NODE_INDEX_ID_SHIFT environment variable.

This commit sets the default shift for the node ids to 0, i.e. no shift,
so it is completely backwards compatible. Users can set a different
shift using the environment. See docs/bucket-index.md for details.

Setting the shift to something like 4 or 5 can significantly reduce the
disk space needed (saves something like 200 GB on a full planet), but
it costs some performance on updates (they are about 30% slower).

This is an improved version of
osm2pgsql-dev#1058
joto added a commit to joto/osm2pgsql that referenced this pull request Oct 5, 2020
OSM ways have a locality property that we can use to reduce the size of
the index looking up ways a node is in: they are often made up of
sequential node ids. If node N is contained in the way, then there is a
good chance that N+1, N+2, ... are contained in it as well. Thus, if we
group nearby nodes and create an index from node groups to ways, the
index will be significantly smaller. The drawback is that a lookup in
such an index returns false positives, i.e. ways that do not contain the
node of interest. So the smaller index is paid for with a performance
loss for updates.

"Grouping" the ids happens by shifting the id a few bits to the right.
How many exactly can be configured with the
--middle-way-node-index-id-shift option.

This commit sets the default shift for the node ids to 0, i.e. no shift,
so it is completely backwards compatible. Users can set a different
shift using the environment. See docs/bucket-index.md for details.

Setting the shift to something like 4 or 5 can significantly reduce the
disk space needed (saves something like 200 GB on a full planet), but
it costs some performance on updates (they are about 30% slower).

This is an improved version of
osm2pgsql-dev#1058
joto added a commit to joto/osm2pgsql that referenced this pull request Oct 6, 2020
OSM ways have a locality property that we can use to reduce the size of
the index looking up ways a node is in: they are often made up of
sequential node ids. If node N is contained in the way, then there is a
good chance that N+1, N+2, ... are contained in it as well. Thus, if we
group nearby nodes and create an index from node groups to ways, the
index will be significantly smaller. The drawback is that a lookup in
such an index returns false positives, i.e. ways that do not contain the
node of interest. So the smaller index is paid for with a performance
loss for updates.

"Grouping" the ids happens by shifting the id a few bits to the right.
How many exactly can be configured with the
--middle-way-node-index-id-shift option.

This commit sets the default shift for the node ids to 0, i.e. no shift,
so it is completely backwards compatible. Users can set a different
shift using the environment. See docs/bucket-index.md for details.

Setting the shift to something like 4 or 5 can significantly reduce the
disk space needed (saves something like 200 GB on a full planet), but
it costs some performance on updates (they are about 30% slower).

This is an improved version of
osm2pgsql-dev#1058
@lonvia lonvia deleted the reduced-way-node-index branch December 7, 2020 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants