Skip to content

Commit b77bfbc

Browse files
madtibomusically-ut
authored andcommittedMay 2, 2019
Allows downloading/loading a complete StackExchange project (#9)
Using the '-s' switch, download the compressed file from _https://ia800107.us.archive.org/27/items/stackexchange/_, then, uncompress it and load all the files in the database. Add a '-n' switch to move the tables to a given schema.
1 parent 6911201 commit b77bfbc

File tree

3 files changed

+240
-80
lines changed

3 files changed

+240
-80
lines changed
 

‎README.md

+29-10
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,8 @@ Schema hints are taken from [a post on Meta.StackExchange](http://meta.stackexch
77
## Dependencies
88

99
- [`lxml`](http://lxml.de/installation.html)
10-
- [`psychopg2`](http://initd.org/psycopg/docs/install.html)
10+
- [`psycopg2`](http://initd.org/psycopg/docs/install.html)
11+
- [`libarchive-c`](https://pypi.org/project/libarchive-c/)
1112

1213
## Usage
1314

@@ -18,14 +19,14 @@ Schema hints are taken from [a post on Meta.StackExchange](http://meta.stackexch
1819
`Badges.xml`, `Votes.xml`, `Posts.xml`, `Users.xml`, `Tags.xml`.
1920
- In some old dumps, the cases in the filenames are different.
2021
- Execute in the current folder (in parallel, if desired):
21-
- `python load_into_pg.py Badges`
22-
- `python load_into_pg.py Posts`
23-
- `python load_into_pg.py Tags` (not present in earliest dumps)
24-
- `python load_into_pg.py Users`
25-
- `python load_into_pg.py Votes`
26-
- `python load_into_pg.py PostLinks`
27-
- `python load_into_pg.py PostHistory`
28-
- `python load_into_pg.py Comments`
22+
- `python load_into_pg.py -t Badges`
23+
- `python load_into_pg.py -t Posts`
24+
- `python load_into_pg.py -t Tags` (not present in earliest dumps)
25+
- `python load_into_pg.py -t Users`
26+
- `python load_into_pg.py -t Votes`
27+
- `python load_into_pg.py -t PostLinks`
28+
- `python load_into_pg.py -t PostHistory`
29+
- `python load_into_pg.py -t Comments`
2930
- Finally, after all the initial tables have been created:
3031
- `psql stackoverflow < ./sql/final_post.sql`
3132
- If you used a different database name, make sure to use that instead of
@@ -34,7 +35,25 @@ Schema hints are taken from [a post on Meta.StackExchange](http://meta.stackexch
3435
- `psql stackoverflow < ./sql/optional_post.sql`
3536
- Again, remember to user the correct database name here, if not `stackoverflow`.
3637

37-
## Caveats
38+
## Loading a complete stackexchange project
39+
40+
You can use the script to download a given stackexchange compressed file from
41+
[archive.org](https://ia800107.us.archive.org/27/items/stackexchange/) and load
42+
all the tables at once, using the `-s` switch.
43+
44+
You will need the `urllib` and `libarchive` modules.
45+
46+
If you give a schema name using the `-n` switch, all the tables will be moved
47+
to the given schema. This schema will be created in the script.
48+
49+
To load the _dba.stackexchange.com_ project in the `dba` schema, you would execute:
50+
`./load_into_pg.py -s dba -n dba`
51+
52+
The paths are not changed in the final scripts `sql/final_post.sql` and
53+
`sql/optional_post.sql`. To run them, first set the _search_path_ to your
54+
schema name: `SET search_path TO <myschema>;`
55+
56+
## Caveats and TODOs
3857

3958
- It prepares some indexes and views which may not be necessary for your analysis.
4059
- The `Body` field in `Posts` table is NOT populated by default. You have to use `--with-post-body` argument to include it.

‎load_into_pg.py

+210-69
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,10 @@
11
#!/usr/bin/env python
2+
23
import sys
34
import time
45
import argparse
56
import psycopg2 as pg
7+
import os
68
import row_processor as Processor
79
import six
810
import json
@@ -12,20 +14,71 @@
1214
('Posts', 'ViewCount'): "NULLIF(%(ViewCount)s, '')::int"
1315
}
1416

17+
# part of the file already downloaded
18+
file_part = None
19+
20+
21+
def show_progress(block_num, block_size, total_size):
22+
"""Display the total size of the file to download and the progress in percent"""
23+
global file_part
24+
if file_part is None:
25+
suffixes = ['B', 'KB', 'MB', 'GB', 'TB']
26+
suffixIndex = 0
27+
pp_size = total_size
28+
while pp_size > 1024:
29+
suffixIndex += 1 # Increment the index of the suffix
30+
pp_size = pp_size / 1024.0 # Apply the division
31+
six.print_('Total file size is: {0:.1f} {1}'
32+
.format(pp_size, suffixes[suffixIndex]))
33+
six.print_("0 % of the file downloaded ...\r", end="", flush=True)
34+
file_part = 0
35+
36+
downloaded = block_num * block_size
37+
if downloaded < total_size:
38+
percent = 100 * downloaded / total_size
39+
if percent - file_part > 1:
40+
file_part = percent
41+
six.print_("{0} % of the file downloaded ...\r".format(int(percent)), end="", flush=True)
42+
else:
43+
file_part = None
44+
six.print_("")
45+
46+
47+
def buildConnectionString(dbname, mbHost, mbPort, mbUsername, mbPassword):
48+
dbConnectionParam = "dbname={}".format(dbname)
49+
50+
if mbPort is not None:
51+
dbConnectionParam += ' port={}'.format(mbPort)
52+
53+
if mbHost is not None:
54+
dbConnectionParam += ' host={}'.format(mbHost)
55+
56+
# TODO Is the escaping done here correct?
57+
if mbUsername is not None:
58+
dbConnectionParam += ' user={}'.format(mbUsername)
59+
60+
# TODO Is the escaping done here correct?
61+
if mbPassword is not None:
62+
dbConnectionParam += ' password={}'.format(mbPassword)
63+
return dbConnectionParam
64+
65+
1566
def _makeDefValues(keys):
1667
"""Returns a dictionary containing None for all keys."""
17-
return dict(( (k, None) for k in keys ))
68+
return dict(((k, None) for k in keys))
69+
1870

1971
def _createMogrificationTemplate(table, keys, insertJson):
2072
"""Return the template string for mogrification for the given keys."""
21-
table_keys = ', '.join( [ '%(' + k + ')s' if (table, k) not in specialRules
22-
else specialRules[table, k]
23-
for k in keys ])
73+
table_keys = ', '.join(['%(' + k + ')s' if (table, k) not in specialRules
74+
else specialRules[table, k]
75+
for k in keys])
2476
if insertJson:
2577
return ('(' + table_keys + ', %(jsonfield)s' + ')')
2678
else:
2779
return ('(' + table_keys + ')')
2880

81+
2982
def _createCmdTuple(cursor, keys, templ, attribs, insertJson):
3083
"""Use the cursor to mogrify a tuple of data.
3184
The passed data in `attribs` is augmented with default data (NULLs) and the
@@ -37,14 +90,14 @@ def _createCmdTuple(cursor, keys, templ, attribs, insertJson):
3790
defs.update(attribs)
3891

3992
if insertJson:
40-
dict_attribs = { }
93+
dict_attribs = {}
4194
for name, value in attribs.items():
4295
dict_attribs[name] = value
4396
defs['jsonfield'] = json.dumps(dict_attribs)
4497

45-
values_to_insert = cursor.mogrify(templ, defs)
4698
return cursor.mogrify(templ, defs)
4799

100+
48101
def _getTableKeys(table):
49102
"""Return an array of the keys for a given table"""
50103
keys = None
@@ -131,26 +184,27 @@ def _getTableKeys(table):
131184
]
132185
elif table == 'PostHistory':
133186
keys = [
134-
'Id',
135-
'PostHistoryTypeId',
136-
'PostId',
137-
'RevisionGUID',
138-
'CreationDate',
139-
'UserId',
140-
'Text'
187+
'Id'
188+
, 'PostHistoryTypeId'
189+
, 'PostId'
190+
, 'RevisionGUID'
191+
, 'CreationDate'
192+
, 'UserId'
193+
, 'Text'
141194
]
142195
elif table == 'Comments':
143196
keys = [
144-
'Id',
145-
'PostId',
146-
'Score',
147-
'Text',
148-
'CreationDate',
149-
'UserId',
197+
'Id'
198+
, 'PostId'
199+
, 'Score'
200+
, 'Text'
201+
, 'CreationDate'
202+
, 'UserId'
150203
]
151204
return keys
152205

153-
def handleTable(table, insertJson, createFk, dbname, mbDbFile, mbHost, mbPort, mbUsername, mbPassword):
206+
207+
def handleTable(table, insertJson, createFk, mbDbFile, dbConnectionParam):
154208
"""Handle the table including the post/pre processing."""
155209
keys = _getTableKeys(table)
156210
dbFile = mbDbFile if mbDbFile is not None else table + '.xml'
@@ -165,23 +219,6 @@ def handleTable(table, insertJson, createFk, dbname, mbDbFile, mbHost, mbPort, m
165219
six.print_("Could not load pre/post/fk sql. Are you running from the correct path?", file=sys.stderr)
166220
sys.exit(-1)
167221

168-
dbConnectionParam = "dbname={}".format(dbname)
169-
170-
if mbPort is not None:
171-
dbConnectionParam += ' port={}'.format(mbPort)
172-
173-
if mbHost is not None:
174-
dbConnectionParam += ' host={}'.format(mbHost)
175-
176-
# TODO Is the escaping done here correct?
177-
if mbUsername is not None:
178-
dbConnectionParam += ' user={}'.format(mbUsername)
179-
180-
# TODO Is the escaping done here correct?
181-
if mbPassword is not None:
182-
dbConnectionParam += ' password={}'.format(mbPassword)
183-
184-
185222
try:
186223
with pg.connect(dbConnectionParam) as conn:
187224
with conn.cursor() as cur:
@@ -199,32 +236,32 @@ def handleTable(table, insertJson, createFk, dbname, mbDbFile, mbHost, mbPort, m
199236
six.print_('Processing data ...')
200237
for rows in Processor.batch(Processor.parse(xml), 500):
201238
valuesStr = ',\n'.join(
202-
[ _createCmdTuple(cur, keys, tmpl, row_attribs, insertJson).decode('utf-8')
203-
for row_attribs in rows
204-
]
205-
)
239+
[_createCmdTuple(cur, keys, tmpl, row_attribs, insertJson).decode('utf-8')
240+
for row_attribs in rows
241+
]
242+
)
206243
if len(valuesStr) > 0:
207244
cmd = 'INSERT INTO ' + table + \
208245
' VALUES\n' + valuesStr + ';'
209246
cur.execute(cmd)
210247
conn.commit()
211-
six.print_('Table {0} processing took {1:.1f} seconds'.format(table, time.time() - start_time))
248+
six.print_('Table \'{0}\' processing took {1:.1f} seconds'.format(table, time.time() - start_time))
212249

213250
# Post-processing (creation of indexes)
214251
start_time = time.time()
215252
six.print_('Post processing ...')
216253
if post != '':
217254
cur.execute(post)
218255
conn.commit()
219-
six.print_('Post processing took {} seconds'.format(time.time() - start_time))
256+
six.print_('Post processing took {0:.1f} seconds'.format(time.time() - start_time))
220257
if createFk:
221258
# fk-processing (creation of foreign keys)
222259
start_time = time.time()
223-
six.print_('fk processing ...')
260+
six.print_('Foreign Key processing ...')
224261
if post != '':
225262
cur.execute(fk)
226263
conn.commit()
227-
six.print_('fk processing took {} seconds'.format(time.time() - start_time))
264+
six.print_('Foreign Key processing took {0:.1f} seconds'.format(time.time() - start_time))
228265

229266
except IOError as e:
230267
six.print_("Could not read from file {}.".format(dbFile), file=sys.stderr)
@@ -237,80 +274,184 @@ def handleTable(table, insertJson, createFk, dbname, mbDbFile, mbHost, mbPort, m
237274
six.print_("Warning from the database.", file=sys.stderr)
238275
six.print_("pg.Warning: {0}".format(str(w)), file=sys.stderr)
239276

277+
278+
def moveTableToSchema(table, schemaName, dbConnectionParam):
279+
try:
280+
with pg.connect(dbConnectionParam) as conn:
281+
with conn.cursor() as cur:
282+
# create the schema
283+
cur.execute('CREATE SCHEMA IF NOT EXISTS ' + schemaName + ';')
284+
conn.commit()
285+
# move the table to the right schema
286+
cur.execute('ALTER TABLE ' + table + ' SET SCHEMA ' + schemaName + ';')
287+
conn.commit()
288+
except pg.Error as e:
289+
six.print_("Error in dealing with the database.", file=sys.stderr)
290+
six.print_("pg.Error ({0}): {1}".format(e.pgcode, e.pgerror), file=sys.stderr)
291+
six.print_(str(e), file=sys.stderr)
292+
except pg.Warning as w:
293+
six.print_("Warning from the database.", file=sys.stderr)
294+
six.print_("pg.Warning: {0}".format(str(w)), file=sys.stderr)
295+
240296
#############################################################
241297

298+
242299
parser = argparse.ArgumentParser()
243-
parser.add_argument( 'table'
300+
parser.add_argument('-t', '--table'
244301
, help = 'The table to work on.'
245302
, choices = ['Users', 'Badges', 'Posts', 'Tags', 'Votes', 'PostLinks', 'PostHistory', 'Comments']
303+
, default = None
246304
)
247305

248-
parser.add_argument( '-d', '--dbname'
306+
parser.add_argument('-d', '--dbname'
249307
, help = 'Name of database to create the table in. The database must exist.'
250308
, default = 'stackoverflow'
251309
)
252310

253-
parser.add_argument( '-f', '--file'
311+
parser.add_argument('-f', '--file'
254312
, help = 'Name of the file to extract data from.'
255313
, default = None
256314
)
257315

258-
parser.add_argument( '-u', '--username'
316+
parser.add_argument('-s', '--so-project'
317+
, help = 'StackExchange project to load.'
318+
, default = None
319+
)
320+
321+
parser.add_argument('--archive-url'
322+
, help = 'URL of the archive directory to retrieve.'
323+
, default = 'https://ia800107.us.archive.org/27/items/stackexchange'
324+
)
325+
326+
parser.add_argument('-k', '--keep-archive'
327+
, help = 'Will preserve the downloaded archive instead of deleting it.'
328+
, action = 'store_true'
329+
, default = False
330+
)
331+
332+
parser.add_argument('-u', '--username'
259333
, help = 'Username for the database.'
260334
, default = None
261335
)
262336

263-
parser.add_argument( '-p', '--password'
337+
parser.add_argument('-p', '--password'
264338
, help = 'Password for the database.'
265339
, default = None
266340
)
267341

268-
parser.add_argument( '-P', '--port'
342+
parser.add_argument('-P', '--port'
269343
, help = 'Port to connect with the database on.'
270344
, default = None
271345
)
272346

273-
parser.add_argument( '-H', '--host'
347+
parser.add_argument('-H', '--host'
274348
, help = 'Hostname for the database.'
275349
, default = None
276350
)
277351

278-
parser.add_argument( '--with-post-body'
279-
, help = 'Import the posts with the post body. Only used if importing Posts.xml'
280-
, action = 'store_true'
352+
parser.add_argument('--with-post-body'
353+
, help = 'Import the posts with the post body. Only used if importing Posts.xml'
354+
, action = 'store_true'
281355
, default = False
282356
)
283357

284-
parser.add_argument( '-j', '--insert-json'
358+
parser.add_argument('-j', '--insert-json'
285359
, help = 'Insert raw data as JSON.'
286-
, action = 'store_true'
360+
, action = 'store_true'
287361
, default = False
288362
)
289363

290-
parser.add_argument( '--foreign-keys'
364+
parser.add_argument('-n', '--schema-name'
365+
, help = 'Use specific schema.'
366+
, default = 'public'
367+
)
368+
369+
parser.add_argument('--foreign-keys'
291370
, help = 'Create foreign keys.'
292-
, action = 'store_true'
371+
, action = 'store_true'
293372
, default = False
294373
)
295374

296375
args = parser.parse_args()
297376

298-
table = args.table
299-
300377
try:
301378
# Python 2/3 compatibility
302379
input = raw_input
303380
except NameError:
304381
pass
305382

383+
dbConnectionParam = buildConnectionString(args.dbname, args.host, args.port, args.username, args.password)
384+
385+
# load given file in table
386+
if args.file and args.table:
387+
table = args.table
306388

307-
if table == 'Posts':
308-
# If the user has not explicitly asked for loading the body, we replace it with NULL
309-
if not args.with_post_body:
310-
specialRules[('Posts', 'Body')] = 'NULL'
389+
if table == 'Posts':
390+
# If the user has not explicitly asked for loading the body, we replace it with NULL
391+
if not args.with_post_body:
392+
specialRules[('Posts', 'Body')] = 'NULL'
393+
394+
choice = input('This will drop the {} table. Are you sure [y/n]?'.format(table))
395+
if len(choice) > 0 and choice[0].lower() == 'y':
396+
handleTable(table, args.insert_json, args.foreign_keys, args.file, dbConnectionParam)
397+
else:
398+
six.print_("Cancelled.")
399+
if args.schema_name != 'public':
400+
moveTableToSchema(table, args.schema_name, dbConnectionParam)
401+
exit(0)
402+
403+
# load a project
404+
elif args.so_project:
405+
import libarchive
406+
import tempfile
407+
408+
filepath = None
409+
temp_dir = None
410+
if args.file:
411+
filepath = args.file
412+
url = filepath
413+
else:
414+
# download the 7z archive in tempdir
415+
file_name = args.so_project + '.stackexchange.com.7z'
416+
url = '{0}/{1}'.format(args.archive_url, file_name)
417+
temp_dir = tempfile.mkdtemp(prefix='so_')
418+
filepath = os.path.join(temp_dir, file_name)
419+
six.print_('Downloading the archive in {0}'.format(filepath))
420+
six.print_('please be patient ...')
421+
try:
422+
six.moves.urllib.request.urlretrieve(url, filepath, show_progress)
423+
except Exception as e:
424+
six.print_('Error: impossible to download the {0} archive ({1})'.format(url, e))
425+
exit(1)
426+
427+
try:
428+
libarchive.extract_file(filepath)
429+
except Exception as e:
430+
six.print_('Error: impossible to extract the {0} archive ({1})'.format(url, e))
431+
exit(1)
432+
433+
tables = ['Tags', 'Users', 'Badges', 'Posts', 'Comments',
434+
'Votes', 'PostLinks', 'PostHistory']
435+
436+
for table in tables:
437+
six.print_('Load {0}.xml file'.format(table))
438+
handleTable(table, args.insert_json, args.foreign_keys, None, dbConnectionParam)
439+
# remove file
440+
os.remove(table + '.xml')
441+
442+
if not args.keep_archive:
443+
os.remove(filepath)
444+
if temp_dir:
445+
# remove the archive and the temporary directory
446+
os.rmdir(temp_dir)
447+
else:
448+
six.print_("Archive '{0}' deleted".format(filepath))
449+
450+
if args.schema_name != 'public':
451+
for table in tables:
452+
moveTableToSchema(table, args.schema_name, dbConnectionParam)
453+
exit(0)
311454

312-
choice = input('This will drop the {} table. Are you sure [y/n]? '.format(table))
313-
if len(choice) > 0 and choice[0].lower() == 'y':
314-
handleTable(table, args.insert_json, args.foreign_keys, args.dbname, args.file, args.host, args.port, args.username, args.password)
315455
else:
316-
six.print_("Cancelled.")
456+
six.print_("Error: you must either use '-f' and '-t' arguments or the '-s' argument.")
457+
parser.print_help()

‎sql/Votes_pre.sql

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
DROP TABLE IF EXISTS Votes CASCADE;
22
CREATE TABLE Votes (
33
Id int PRIMARY KEY ,
4-
PostId int not NULL ,
4+
PostId int , -- not NULL ,
55
VoteTypeId int not NULL ,
66
UserId int ,
77
CreationDate timestamp not NULL ,

0 commit comments

Comments
 (0)
Please sign in to comment.