Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows support for decompressing dumps #13

Open
he7d3r opened this issue Oct 14, 2014 · 1 comment
Open

Windows support for decompressing dumps #13

he7d3r opened this issue Oct 14, 2014 · 1 comment

Comments

@he7d3r
Copy link
Contributor

he7d3r commented Oct 14, 2014

When running this

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
# Example:
# python TEST.py ocwikibooks-20140928-pages-meta-history.xml.bz2
from mw import xml_dump
import sys

def rev_info(dump, path):
    for page in dump:
        yield page.title

def run(dump):
    for title in xml_dump.map(dump, rev_info):
        print(title)
    print('Done.' )

if __name__ == "__main__":
    run([sys.argv[1]])

on Windows, the result is the following:

C:\Users\Diego\Desktop\TEST> python aaa.py ocwikibooks-20140928-pages-articles.xml.bz2
Traceback (most recent call last):
  File "c:\Program Files\Python32\lib\pickle.py", line 683, in save_global
    klass = getattr(mod, name)
AttributeError: 'module' object has no attribute 'dec'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "aaa.py", line 18, in <module>
    run([sys.argv[1]])
  File "aaa.py", line 13, in run
    for title in xml_dump.map(dump, rev_info):
  File "c:\Program Files\Python32\lib\site-packages\mediawiki_utilities-0.4.1-py3.2.egg\mw\xml_dump\map.py", line 72, in map
    processor.start()
  File "c:\Program Files\Python32\lib\multiprocessing\process.py", line 132, in start
    self._popen = Popen(self)
  File "c:\Program Files\Python32\lib\multiprocessing\forking.py", line 266, in __init__
    dump(process_obj, to_child, HIGHEST_PROTOCOL)
  File "c:\Program Files\Python32\lib\multiprocessing\forking.py", line 188, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "c:\Program Files\Python32\lib\pickle.py", line 237, in dump
    self.save(obj)
  File "c:\Program Files\Python32\lib\pickle.py", line 344, in save
    self.save_reduce(obj=obj, *rv)
  File "c:\Program Files\Python32\lib\pickle.py", line 432, in save_reduce
    save(state)
  File "c:\Program Files\Python32\lib\pickle.py", line 299, in save
    f(self, obj) # Call unbound method with explicit self
  File "c:\Program Files\Python32\lib\pickle.py", line 627, in save_dict
    self._batch_setitems(obj.items())
  File "c:\Program Files\Python32\lib\pickle.py", line 660, in _batch_setitems
    save(v)
  File "c:\Program Files\Python32\lib\pickle.py", line 299, in save
    f(self, obj) # Call unbound method with explicit self
  File "c:\Program Files\Python32\lib\pickle.py", line 687, in save_global
    (obj, module, name))
_pickle.PicklingError: Can't pickle <function dec at 0x0000000002812F48>: it's not found as mw.xml_dump.map.dec

C:\Users\Diego\Desktop\TEST>Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "c:\Program Files\Python32\lib\multiprocessing\forking.py", line 369, in main
    self = load(from_parent)
EOFError
@halfak
Copy link
Member

halfak commented Dec 6, 2014

I spent some time working on this today. Regretfully, the 7z support in python libs is bad. There exists pylzma which has a module py7zlib that provides a means to examine a 7z archive. I've been trying to use that to decompress an XML dump, but it looks like the system can't handle the massive file size (max 2GB).

Traceback (most recent call last):
File "", line 4, in
File "/home/halfak/.pyenv/versions/2.7/lib/python2.7/site-packages/py7zlib.py", line 576, in read
data = getattr(self, decoder)(coder, data)
File "/home/halfak/.pyenv/versions/2.7/lib/python2.7/site-packages/py7zlib.py", line 632, in _read_lzma
dec = pylzma.decompressobj(maxlength=self._start+self.size)
OverflowError: signed integer is greater than maximum

This happens in python2 or 3. There's some suggestions that you manage the file length yourself. E.g.: fancycode/pylzma#3

I'm going to drop this for today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants