Skip to content

Commit b77bfbc

Browse files
madtibomusically-ut
authored andcommitted
Allows downloading/loading a complete StackExchange project (#9)
Using the '-s' switch, download the compressed file from _https://ia800107.us.archive.org/27/items/stackexchange/_, then, uncompress it and load all the files in the database. Add a '-n' switch to move the tables to a given schema.
1 parent 6911201 commit b77bfbc

File tree

3 files changed

+240
-80
lines changed

3 files changed

+240
-80
lines changed

README.md

Lines changed: 29 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,8 @@ Schema hints are taken from [a post on Meta.StackExchange](http://meta.stackexch
77
## Dependencies
88

99
- [`lxml`](http://lxml.de/installation.html)
10-
- [`psychopg2`](http://initd.org/psycopg/docs/install.html)
10+
- [`psycopg2`](http://initd.org/psycopg/docs/install.html)
11+
- [`libarchive-c`](https://pypi.org/project/libarchive-c/)
1112

1213
## Usage
1314

@@ -18,14 +19,14 @@ Schema hints are taken from [a post on Meta.StackExchange](http://meta.stackexch
1819
`Badges.xml`, `Votes.xml`, `Posts.xml`, `Users.xml`, `Tags.xml`.
1920
- In some old dumps, the cases in the filenames are different.
2021
- Execute in the current folder (in parallel, if desired):
21-
- `python load_into_pg.py Badges`
22-
- `python load_into_pg.py Posts`
23-
- `python load_into_pg.py Tags` (not present in earliest dumps)
24-
- `python load_into_pg.py Users`
25-
- `python load_into_pg.py Votes`
26-
- `python load_into_pg.py PostLinks`
27-
- `python load_into_pg.py PostHistory`
28-
- `python load_into_pg.py Comments`
22+
- `python load_into_pg.py -t Badges`
23+
- `python load_into_pg.py -t Posts`
24+
- `python load_into_pg.py -t Tags` (not present in earliest dumps)
25+
- `python load_into_pg.py -t Users`
26+
- `python load_into_pg.py -t Votes`
27+
- `python load_into_pg.py -t PostLinks`
28+
- `python load_into_pg.py -t PostHistory`
29+
- `python load_into_pg.py -t Comments`
2930
- Finally, after all the initial tables have been created:
3031
- `psql stackoverflow < ./sql/final_post.sql`
3132
- If you used a different database name, make sure to use that instead of
@@ -34,7 +35,25 @@ Schema hints are taken from [a post on Meta.StackExchange](http://meta.stackexch
3435
- `psql stackoverflow < ./sql/optional_post.sql`
3536
- Again, remember to user the correct database name here, if not `stackoverflow`.
3637

37-
## Caveats
38+
## Loading a complete stackexchange project
39+
40+
You can use the script to download a given stackexchange compressed file from
41+
[archive.org](https://ia800107.us.archive.org/27/items/stackexchange/) and load
42+
all the tables at once, using the `-s` switch.
43+
44+
You will need the `urllib` and `libarchive` modules.
45+
46+
If you give a schema name using the `-n` switch, all the tables will be moved
47+
to the given schema. This schema will be created in the script.
48+
49+
To load the _dba.stackexchange.com_ project in the `dba` schema, you would execute:
50+
`./load_into_pg.py -s dba -n dba`
51+
52+
The paths are not changed in the final scripts `sql/final_post.sql` and
53+
`sql/optional_post.sql`. To run them, first set the _search_path_ to your
54+
schema name: `SET search_path TO <myschema>;`
55+
56+
## Caveats and TODOs
3857

3958
- It prepares some indexes and views which may not be necessary for your analysis.
4059
- The `Body` field in `Posts` table is NOT populated by default. You have to use `--with-post-body` argument to include it.

0 commit comments

Comments
 (0)