Skip to content

Commit 5b05083

Browse files
committed
Way node index using shifted node ids
OSM ways have a locality property that we can use to reduce the size of the index looking up ways a node is in: they are often made up of sequential node ids. If node N is contained in the way, then there is a good chance that N+1, N+2, ... are contained in it as well. Thus, if we group nearby nodes and create an index from node groups to ways, the index will be significantly smaller. The drawback is that a lookup in such an index returns false positives, i.e. ways that do not contain the node of interest. So the smaller index is paid for with a performance loss for updates. "Grouping" the ids happens by shifting the id a few bits to the right. How many exactly can be configured with the OSM2PGSQL_WAY_NODE_INDEX_ID_SHIFT environment variable. This commit sets the default shift for the node ids to 0, i.e. no shift, so it is completely backwards compatible. Users can set a different shift using the environment. See docs/bucket-index.md for details. Setting the shift to something like 4 or 5 can significantly reduce the disk space needed (saves something like 200 GB on a full planet), but it costs some performance on updates (they are about 30% slower). This is an improved version of #1058
1 parent a424919 commit 5b05083

File tree

4 files changed

+190
-9
lines changed

4 files changed

+190
-9
lines changed

docs/bucket-index.md

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
2+
NOTE: This is only available from osm2pgsql version 1.4.0!
3+
4+
NOTE: The default is still to create the old index for now.
5+
6+
# Bucket index for slim mode
7+
8+
Osm2pgsql can use an index for way node lookups in slim mode that needs a lot
9+
less disk space than earlier versions did. For a planet the savings can be
10+
about 200 GB! Lookup times are slightly slower, but this shouldn't be an issue
11+
for most people.
12+
13+
*If you are not using slim mode and/or not doing updates of your database, this
14+
does not apply to you.*
15+
16+
For backwards compatibility osm2pgsql will never update an existing database
17+
to the new index. It will keep using the old index. So you do not have to do
18+
anything when upgrading osm2pgsql.
19+
20+
If you want to use the new index, there are two ways of doing this: The "safe"
21+
way for most users and the "doit-it-yourself" way for expert users. Note that
22+
once you switched to the new index, older versions of osm2pgsql will not work
23+
correctly any more.
24+
25+
## Update for most users
26+
27+
NOTE: This does not work yet. Currently the default is still to create the
28+
old type of index.
29+
30+
If your database was created with an older version of osm2pgsql you might want
31+
to start again from an empty database. Just do a reimport and osm2pgsql will
32+
use the new space-saving index.
33+
34+
## Update for expert users
35+
36+
This is only for users who are very familiar with osm2pgsql and PostgreSQL
37+
operation. You can break your osm2pgsql database beyond repair if something
38+
goes wrong here and you might not even notice.
39+
40+
You can create the index yourself by following these steps:
41+
42+
Drop the existing index. Replace `{prefix}` by the prefix you are using.
43+
Usually this is `planet_osm`:
44+
45+
```
46+
DROP INDEX {prefix}_ways_nodes_idx;
47+
```
48+
49+
Create the `index_bucket` function needed for the index. Replace
50+
`{way_node_index_id_shift}` by the number of bits you want the id to be
51+
shifted. If you don't have a reason to use something else, use `5`:
52+
53+
```
54+
CREATE FUNCTION {prefix}_index_bucket(int8[]) RETURNS int8[] AS $$
55+
SELECT ARRAY(SELECT DISTINCT unnest($1) >> {way_node_index_id_shift})
56+
$$ LANGUAGE SQL IMMUTABLE;
57+
```
58+
59+
Now you can create the new index. Again, replace `{prefix}` by the prefix
60+
you are using:
61+
62+
```
63+
CREATE INDEX {prefix}_ways_nodes_bucket_idx ON {prefix}_ways
64+
USING GIN ({prefix}_index_bucket(nodes))
65+
WITH (fastupdate = off);
66+
```
67+
68+
If you want to create the index in a specific tablespace you can do this:
69+
70+
```
71+
CREATE INDEX {prefix}_ways_nodes_bucket_idx ON {prefix}_ways
72+
USING GIN ({prefix}_index_bucket(nodes))
73+
WITH (fastupdate = off) TABLESPACE {tablespace};
74+
```
75+
76+
## Id shift (for experts)
77+
78+
When creating a new database (when used in create mode with slim option),
79+
osm2pgsql can create a bucket index using a configurable id shift.
80+
81+
You can set the environment variable `OSM2PGSQL_WAY_NODE_INDEX_ID_SHIFT` to the
82+
shift you want. Values between about 3 and 6 might make sense.
83+
84+
To completely disable the bucket index and create an index compatible with
85+
earlier versions of osm2pgsql, set `OSM2PGSQL_WAY_NODE_INDEX_ID_SHIFT` to `0`.
86+
(This is currently still the default.)
87+

src/middle-pgsql.cpp

Lines changed: 51 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -55,8 +55,8 @@ static std::string build_sql(options_t const &options, char const *templ)
5555
fmt::arg("unlogged", options.droptemp ? "UNLOGGED" : ""),
5656
fmt::arg("using_tablespace", using_tablespace),
5757
fmt::arg("data_tablespace", tablespace_clause(options.tblsslim_data)),
58-
fmt::arg("index_tablespace",
59-
tablespace_clause(options.tblsslim_index)));
58+
fmt::arg("index_tablespace", tablespace_clause(options.tblsslim_index)),
59+
fmt::arg("way_node_index_id_shift", options.way_node_index_id_shift));
6060
}
6161

6262
middle_pgsql_t::table_desc::table_desc(options_t const &options,
@@ -634,7 +634,8 @@ static table_sql sql_for_nodes() noexcept
634634
return sql;
635635
}
636636

637-
static table_sql sql_for_ways() noexcept
637+
static table_sql sql_for_ways(bool has_bucket_index,
638+
uint8_t way_node_index_id_shift) noexcept
638639
{
639640
table_sql sql{};
640641

@@ -653,12 +654,33 @@ static table_sql sql_for_ways() noexcept
653654
" SELECT id, nodes, tags"
654655
" FROM {prefix}_ways WHERE id = ANY($1::int8[]);\n";
655656

656-
sql.prepare_mark = "PREPARE mark_ways_by_node(int8) AS"
657-
" SELECT id FROM {prefix}_ways"
658-
" WHERE nodes && ARRAY[$1];\n";
657+
if (has_bucket_index) {
658+
sql.prepare_mark = "PREPARE mark_ways_by_node(int8) AS"
659+
" SELECT id FROM {prefix}_ways w"
660+
" WHERE $1 = ANY(nodes)"
661+
" AND {prefix}_index_bucket(w.nodes)"
662+
" && {prefix}_index_bucket(ARRAY[$1]);\n";
663+
} else {
664+
sql.prepare_mark = "PREPARE mark_ways_by_node(int8) AS"
665+
" SELECT id FROM {prefix}_ways"
666+
" WHERE nodes && ARRAY[$1];\n";
667+
}
659668

660-
sql.create_index = "CREATE INDEX ON {prefix}_ways USING GIN (nodes)"
661-
" WITH (fastupdate = off) {index_tablespace};\n";
669+
if (way_node_index_id_shift == 0) {
670+
sql.create_index = "CREATE INDEX ON {prefix}_ways USING GIN (nodes)"
671+
" WITH (fastupdate = off) {index_tablespace};\n";
672+
} else {
673+
sql.create_index = "CREATE OR REPLACE FUNCTION"
674+
" {prefix}_index_bucket(int8[])"
675+
" RETURNS int8[] AS $$\n"
676+
" SELECT ARRAY(SELECT DISTINCT"
677+
" unnest($1) >> {way_node_index_id_shift})\n"
678+
"$$ LANGUAGE SQL IMMUTABLE;\n"
679+
"CREATE INDEX {prefix}_ways_nodes_bucket_idx"
680+
" ON {prefix}_ways"
681+
" USING GIN ({prefix}_index_bucket(nodes))"
682+
" WITH (fastupdate = off) {index_tablespace};\n";
683+
}
662684

663685
return sql;
664686
}
@@ -697,6 +719,16 @@ static table_sql sql_for_relations() noexcept
697719
return sql;
698720
}
699721

722+
static bool check_bucket_index(pg_conn_t *db_connection,
723+
std::string const &prefix)
724+
{
725+
auto const res = db_connection->query(
726+
PGRES_TUPLES_OK,
727+
"SELECT relname FROM pg_class WHERE relkind='i' AND"
728+
" relname = '{}_ways_nodes_bucket_idx';"_format(prefix));
729+
return res.num_tuples() > 0;
730+
}
731+
700732
middle_pgsql_t::middle_pgsql_t(options_t const *options)
701733
: m_append(options->append), m_out_options(options),
702734
m_cache(new node_ram_cache{options->alloc_chunkwise | ALLOC_LOSSY,
@@ -712,8 +744,18 @@ middle_pgsql_t::middle_pgsql_t(options_t const *options)
712744

713745
fmt::print(stderr, "Mid: pgsql, cache={}\n", options->cache);
714746

747+
bool const has_bucket_index =
748+
check_bucket_index(&m_db_connection, options->prefix);
749+
750+
if (!has_bucket_index && options->append) {
751+
fmt::print(stderr, "You don't have a bucket index. See"
752+
" docs/bucket-index.md for details.\n");
753+
}
754+
715755
m_tables[NODE_TABLE] = table_desc{*options, sql_for_nodes()};
716-
m_tables[WAY_TABLE] = table_desc{*options, sql_for_ways()};
756+
m_tables[WAY_TABLE] =
757+
table_desc{*options, sql_for_ways(has_bucket_index,
758+
options->way_node_index_id_shift)};
717759
m_tables[REL_TABLE] = table_desc{*options, sql_for_relations()};
718760
}
719761

src/options.cpp

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -324,8 +324,45 @@ static osmium::Box parse_bbox(char const *bbox)
324324
return osmium::Box{minx, miny, maxx, maxy};
325325
}
326326

327+
template <typename T>
328+
T get_env_unsigned(char const *var, T default_value)
329+
{
330+
char const *const str = std::getenv(var);
331+
332+
if (!str) {
333+
return default_value;
334+
}
335+
336+
char *end = nullptr;
337+
auto const val = std::strtoull(str, &end, 10);
338+
339+
if (*end != '\0') {
340+
fmt::print("Warning! Could not parse value of env variable {}.\n"
341+
" Using default value of {}.\n\n",
342+
var, default_value);
343+
return default_value;
344+
}
345+
346+
if (val > std::numeric_limits<T>::max()) {
347+
fmt::print("Warning! Value of env variable {} out of range.\n"
348+
" Using default value of {}.\n\n",
349+
var, default_value);
350+
return default_value;
351+
}
352+
353+
return static_cast<T>(val);
354+
}
355+
356+
void options_t::get_options_from_env()
357+
{
358+
way_node_index_id_shift = get_env_unsigned<uint8_t>(
359+
"OSM2PGSQL_WAY_NODE_INDEX_ID_SHIFT", default_way_node_index_id_shift);
360+
}
361+
327362
options_t::options_t(int argc, char *argv[]) : options_t()
328363
{
364+
get_options_from_env();
365+
329366
bool help_verbose = false; // Will be set when -v/--verbose is set
330367

331368
int c;

src/options.hpp

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,12 @@ struct database_options_t
3636
std::string conninfo() const;
3737
};
3838

39+
/**
40+
* Default value for the way node index id shift. Currently the default is 0,
41+
* making osm2pgsql backwards compatible to earlier versions.
42+
*/
43+
constexpr uint8_t const default_way_node_index_id_shift = 0;
44+
3945
/**
4046
* Structure for storing command-line and other options
4147
*/
@@ -130,7 +136,16 @@ class options_t
130136

131137
std::vector<std::string> input_files;
132138

139+
/**
140+
* How many bits should the node id be shifted for the way node index?
141+
* Use 0 to disable for backwards compatibility.
142+
*/
143+
uint8_t way_node_index_id_shift = 0;
144+
133145
private:
146+
/// Set advanced options from environment
147+
void get_options_from_env();
148+
134149
/**
135150
* Check input options for sanity
136151
*/

0 commit comments

Comments
 (0)