[EXPERIMENTAL] Reduced way node index #1058

lonvia · 2020-01-11T08:49:19Z

Following #1045, I've been experimenting with reducing the size of the way-node index by creating a lossy index.

OSM ways have a locality property that we can use to reduce the index size: they are often made up of sequential node ids. If node N is contained in the way, then there is a good chance that N+1, N+2, ... are contained in it as well. Thus, if we group nearby nodes and create an index from node groups to ways, the index will be significantly smaller. The drawback is that a lookup in such an index returns false positives, i.e. ways that do not contain the node of interest. So the smaller index is paid for with a performance loss for updates.

How it works

Creating buckets from node IDs is simple. Just divide through the expected bucket size, i.e. for bucket size 32: node_bucket = node_id/32. To get all buckets that a way has nodes in, take all nodes, divide through bucket size and remove the duplicates. In SQL that works like this:

ARRAY(SELECT DISTINCT unnest(nodes)/32)

We could create an extra column in the place_osm_ways table to save our bucket array. As it turns out (thank you Stackoverflow), this is not really necessary. We can simply create a function that returns a semi-virtual column with our values:

   CREATE FUNCTION bucket_32(planet_osm_ways) RETURNS bigint[] AS
   $func$
	   SELECT array(select distinct unnest(nodes)/32)
	   FROM   planet_osm_ways b
	   WHERE  b.id = $1.id
  $func$ LANGUAGE SQL IMMUTABLE;

As long as the function is immutable, we can even create an index over that function:

CREATE INDEX planet_osm_bucket_32 on planet_osm_ways using gin(bucket_32(planet_osm_ways));

Now we can look up which ways a node is a member by looking up the bucket and the node:

  SELECT id FROM planet_osm_ways w
  WHERE 32001039= any(w.nodes) and ARRAY[32001039::bigint/32]] && w.bucket_32;

This uses our shiny new index:

QUERY PLAN                                                             
--------------------

 Bitmap Heap Scan on planet_osm_ways w  (cost=13.03..402.48 rows=7 width=198) (actual time=0.045..0.063 rows=2 loops=1)
   Recheck Cond: ('{1000032}'::bigint[] && bucket_32(w.*))
   Filter: (32001039 = ANY (nodes))
   Rows Removed by Filter: 8
   Heap Blocks: exact=4
   ->  Bitmap Index Scan on planet_osm_bucket_32  (cost=0.00..13.03 rows=137 width=0) (actual time=0.026..0.026 rows=10 loops=1)
		 Index Cond: ('{1000032}'::bigint[] && bucket_32(w.*))
 Planning Time: 0.579 ms
 Execution Time: 0.113 ms

Performance

I've tested this with a recent planet against a database using the null output for bucket sizes 16.32 and 64.

Bucket size	none	16	32	64
Time to create index (hh:mm)	07:30	2:50	2:35	2:25
Index size (GB)	276	29	17	11
Average number of ways retrieved	1	6.3	10.6	18.3
time to update 1 day (s)	361	419	428	431

"Average number of ways retrieved" is a rough estimate of how many ways need to be loaded from planet_osm_ways for each actual way result we expect.

The updates times are computed by running the same day update 4 times and throwing away the first result. So they are on a hot database and thus on the optimistic side. On a cold database I've seen up to 70% increase in the time. However, it should be noted that all this is done on a null output. Given that the update of the output tables is the dominating factor for updates, I'd expect a lower overall impact for a real world update of, say, an openstreetmap-carto database.

TODO

We need proper tests for the invalidation first. I got the query wrong at the first try and the tests still past happily.

The bucket indexes still have the same issue with the query planner as described in #1045. So we likely still run into jit/parallel worker slowness.

The queries only started to be reasonably fast after a full 'VACUUM ANALZE' on the planet_osm_ways table. This may be because I created the indexes on top of an existing database. It may also be that the vacuum is always necessary.

amandasaurus · 2020-03-24T08:09:21Z

I maintain many osm2pgsql installs, and the large size of the planet_osm_ways_nodes index is a continued, big annoyance for me. So this sort of space saving is fucking fantastic. 😁😁😁 This would be a game changer for me, and would solve so many headaches. Please god make this happen. I'll take the slight slow down if I can save 200GB (!!!!!) of disk space.

lonvia · 2020-03-24T08:52:23Z

@rory If you don't mind living on the edge (including risking data corruption and having to do a reimport), you could help by stress testing this in a real life scenario (aka a minutely update planet). This PR is in theory fully functional but it wold help to confirm that also in practice no update data gets lost or updates suddenly become slow because there are odd corner cases.

pnorman · 2020-03-24T14:50:03Z

This may be because I created the indexes on top of an existing database. It may also be that the vacuum is always necessary.

VACUUM ANALYZE should be run after osm2pgsql finishes creating the table and builds the indexes so all the stats are up to date

lonvia · 2020-03-24T15:27:46Z

Usually a simple ANALYSE will do for that purpose. VACUUM ANALYSE is rather on the expensive side.

pnorman · 2020-03-24T15:40:17Z

Usually a simple ANALYSE will do for that purpose. VACUUM ANALYSE is rather on the expensive side.

VACUUM ANALYZE is needed to populate the FSM and some other table statistics. If we're only running ANALYZE, we should fix it.

Improved version of osm2pgsql-dev#1058 Needs more testing.

lonvia · 2020-09-04T07:06:36Z

Closing in favour of follow-up PR #1271.

Improved version of osm2pgsql-dev#1058 Needs more testing.

OSM ways have a locality property that we can use to reduce the index size: they are often made up of sequential node ids. If node N is contained in the way, then there is a good chance that N+1, N+2, ... are contained in it as well. Thus, if we group nearby nodes and create an index from node groups to ways, the index will be significantly smaller. The drawback is that a lookup in such an index returns false positives, i.e. ways that do not contain the node of interest. So the smaller index is paid for with a performance loss for updates. "Grouping" the ids happens by shifting the id a few bits to the right. How many exactly can be configured with the OSM2PGSQL_WAY_NODE_INDEX_ID_SHIFT environment variable. This commit still sets the default shift for the node ids to 0, so it is completely backwards compatible. Users can set a different shift using the environment. See docs/bucket-index.md for details. This is an improved version of osm2pgsql-dev#1058

OSM ways have a locality property that we can use to reduce the size of the index looking up ways a node is in: they are often made up of sequential node ids. If node N is contained in the way, then there is a good chance that N+1, N+2, ... are contained in it as well. Thus, if we group nearby nodes and create an index from node groups to ways, the index will be significantly smaller. The drawback is that a lookup in such an index returns false positives, i.e. ways that do not contain the node of interest. So the smaller index is paid for with a performance loss for updates. "Grouping" the ids happens by shifting the id a few bits to the right. How many exactly can be configured with the OSM2PGSQL_WAY_NODE_INDEX_ID_SHIFT environment variable. This commit sets the default shift for the node ids to 0, i.e. no shift, so it is completely backwards compatible. Users can set a different shift using the environment. See docs/bucket-index.md for details. Setting the shift to something like 4 or 5 can significantly reduce the disk space needed (saves something like 200 GB on a full planet), but it costs some performance on updates (they are about 30% slower). This is an improved version of osm2pgsql-dev#1058

OSM ways have a locality property that we can use to reduce the size of the index looking up ways a node is in: they are often made up of sequential node ids. If node N is contained in the way, then there is a good chance that N+1, N+2, ... are contained in it as well. Thus, if we group nearby nodes and create an index from node groups to ways, the index will be significantly smaller. The drawback is that a lookup in such an index returns false positives, i.e. ways that do not contain the node of interest. So the smaller index is paid for with a performance loss for updates. "Grouping" the ids happens by shifting the id a few bits to the right. How many exactly can be configured with the --middle-way-node-index-id-shift option. This commit sets the default shift for the node ids to 0, i.e. no shift, so it is completely backwards compatible. Users can set a different shift using the environment. See docs/bucket-index.md for details. Setting the shift to something like 4 or 5 can significantly reduce the disk space needed (saves something like 200 GB on a full planet), but it costs some performance on updates (they are about 30% slower). This is an improved version of osm2pgsql-dev#1058

lonvia added 3 commits January 3, 2020 23:24

use a bucketed way-node index

7f140a1

fix prepared statement

89eaec4

fix index lookup id

504cbf5

joto added a commit to joto/osm2pgsql that referenced this pull request Sep 3, 2020

Draft: Reduced way node index

e8132b5

Improved version of osm2pgsql-dev#1058 Needs more testing.

joto mentioned this pull request Sep 3, 2020

Draft bucket index #1271

Closed

lonvia closed this Sep 4, 2020

joto added a commit to joto/osm2pgsql that referenced this pull request Sep 4, 2020

Draft: Reduced way node index

5080d39

Improved version of osm2pgsql-dev#1058 Needs more testing.

joto mentioned this pull request Sep 23, 2020

Way node index using shifted node ids #1275

Merged

lonvia deleted the reduced-way-node-index branch December 7, 2020 13:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[EXPERIMENTAL] Reduced way node index #1058

[EXPERIMENTAL] Reduced way node index #1058

Uh oh!

lonvia commented Jan 11, 2020

Uh oh!

amandasaurus commented Mar 24, 2020

Uh oh!

lonvia commented Mar 24, 2020

Uh oh!

pnorman commented Mar 24, 2020

Uh oh!

lonvia commented Mar 24, 2020

Uh oh!

pnorman commented Mar 24, 2020

Uh oh!

lonvia commented Sep 4, 2020

Uh oh!

Uh oh!

Uh oh!

[EXPERIMENTAL] Reduced way node index #1058

[EXPERIMENTAL] Reduced way node index #1058

Uh oh!

Conversation

lonvia commented Jan 11, 2020

How it works

Performance

TODO

Uh oh!

amandasaurus commented Mar 24, 2020

Uh oh!

lonvia commented Mar 24, 2020

Uh oh!

pnorman commented Mar 24, 2020

Uh oh!

lonvia commented Mar 24, 2020

Uh oh!

pnorman commented Mar 24, 2020

Uh oh!

lonvia commented Sep 4, 2020

Uh oh!

Uh oh!