Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

download and load a complete stackexchange project #9

Merged
merged 9 commits into from
May 2, 2019
39 changes: 29 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,8 @@ Schema hints are taken from [a post on Meta.StackExchange](http://meta.stackexch
## Dependencies

- [`lxml`](http://lxml.de/installation.html)
- [`psychopg2`](http://initd.org/psycopg/docs/install.html)
- [`psycopg2`](http://initd.org/psycopg/docs/install.html)
- [`libarchive-c`](https://pypi.org/project/libarchive-c/)

## Usage

Expand All @@ -18,14 +19,14 @@ Schema hints are taken from [a post on Meta.StackExchange](http://meta.stackexch
`Badges.xml`, `Votes.xml`, `Posts.xml`, `Users.xml`, `Tags.xml`.
- In some old dumps, the cases in the filenames are different.
- Execute in the current folder (in parallel, if desired):
- `python load_into_pg.py Badges`
- `python load_into_pg.py Posts`
- `python load_into_pg.py Tags` (not present in earliest dumps)
- `python load_into_pg.py Users`
- `python load_into_pg.py Votes`
- `python load_into_pg.py PostLinks`
- `python load_into_pg.py PostHistory`
- `python load_into_pg.py Comments`
- `python load_into_pg.py -t Badges`
- `python load_into_pg.py -t Posts`
- `python load_into_pg.py -t Tags` (not present in earliest dumps)
- `python load_into_pg.py -t Users`
- `python load_into_pg.py -t Votes`
- `python load_into_pg.py -t PostLinks`
- `python load_into_pg.py -t PostHistory`
- `python load_into_pg.py -t Comments`
- Finally, after all the initial tables have been created:
- `psql stackoverflow < ./sql/final_post.sql`
- If you used a different database name, make sure to use that instead of
Expand All @@ -34,7 +35,25 @@ Schema hints are taken from [a post on Meta.StackExchange](http://meta.stackexch
- `psql stackoverflow < ./sql/optional_post.sql`
- Again, remember to user the correct database name here, if not `stackoverflow`.

## Caveats
## Loading a complete stackexchange project

You can use the script to download a given stackexchange compressed file from
[archive.org](https://ia800107.us.archive.org/27/items/stackexchange/) and load
all the tables at once, using the `-s` switch.

You will need the `urllib` and `libarchive` modules.

If you give a schema name using the `-n` switch, all the tables will be moved
to the given schema. This schema will be created in the script.

To load the _dba.stackexchange.com_ project in the `dba` schema, you would execute:
`./load_into_pg.py -s dba -n dba`

The paths are not changed in the final scripts `sql/final_post.sql` and
`sql/optional_post.sql`. To run them, first set the _search_path_ to your
schema name: `SET search_path TO <myschema>;`

## Caveats and TODOs

- It prepares some indexes and views which may not be necessary for your analysis.
- The `Body` field in `Posts` table is NOT populated by default. You have to use `--with-post-body` argument to include it.
Expand Down
Loading