Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

download and load a complete stackexchange project #9

Merged
merged 9 commits into from
May 2, 2019

Conversation

madtibo
Copy link
Contributor

@madtibo madtibo commented Aug 16, 2018

This commit give the possibility, using the -s switch, to download the compressed file from archive.org, then, uncompress it and load all the files in the database.
Add a '-n' switch to move the tables to a given schema

WARNING: since using the urllib.request module, set the script to use python3!

Copy link
Collaborator

@musically-ut musically-ut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

This is a rather big change; so I'll have to run it to verify that it works. However, I don't quite see why we have to make upgrade to Python 3 necessary, esp. because you already are using libarchive instead of internal lzma and urllib is available via six.moves, which offers the same interface as urllib from Python 3.

I can make these minor changes myself when I do the merging.

Thanks again!

@madtibo
Copy link
Contributor Author

madtibo commented Aug 16, 2018

Great! I did not know about the possibility to use six for that.
This would be splendid if you could make the change :-)

@madtibo
Copy link
Contributor Author

madtibo commented Aug 16, 2018

sorry, I was lazy and did not create a distinct PR for this feature.
We could work on it once the PR #8 about foreign key is done.

using the '-s' switch, download the compressed file from _https://ia800107.us.archive.org/27/items/stackexchange/_, then, uncompress it and load all the files in the database. Add a '-n' switch to move the tables to a given schema

WARNING: since using the urllib.request module, set the script to use python3
@madtibo
Copy link
Contributor Author

madtibo commented Apr 1, 2019

Hello @musically-ut,

The "load complete project" MR is ready. I added a few options:

  • '-t' for the table name
  • '--archive-url' to specify a given archive directory
  • '-s' for the SO project name
  • '-k' to keep the downloaded project archive
  • '-f' can then be used to specify the archive file name
  • '-n' to specify a database schema

I tested several cases and found no problem:

./load_into_pg.py -k -s emacs
./load_into_pg.py -k -f /tmp/emacs.stackexchange.com.7z -d emacs
./load_into_pg.py -k -f /tmp/emacs.stackexchange.com.7z -s emacs -d emacs
time ./load_into_pg.py -k -f /tmp/emacs.stackexchange.com.7z -s emacs -d emacs -n emacs
./load_into_pg.py -k -f /tmp//emacs.stackexchange.com.7z -s emacs -d emacs -n json -j
./load_into_pg.py -k -f /tmp/emacs.stackexchange.com.7z -s emacs -d emacs -n foreign_keys --foreign-keys

Tell me what you think of it

@musically-ut
Copy link
Collaborator

Thank you for submitting this!

The code looks good and I don't see any immediate problems with it, but still have to just sit down and test all the options once (essentially the commands you gave in your last comment, thanks for that!)

I'll merge it soon.

@madtibo
Copy link
Contributor Author

madtibo commented Apr 17, 2019

@musically-ut here is a commit using tempfile library. I just get the temporary directory and store the file in it. Does it suites you?


# load a project
elif args.so_project:
import libarchive
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you verify that you are using libarchive-c library instead of libarchive?

I will add this to the README.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am indeed using libarchive-c (in version 2.8).

@musically-ut musically-ut merged commit b77bfbc into Networks-Learning:master May 2, 2019
@musically-ut
Copy link
Collaborator

Thanks for all the hard work! Merged! \o/

@madtibo
Copy link
Contributor Author

madtibo commented May 3, 2019

It was really nice to work on this project.
Thank you for the help and the follow-up!

@madtibo madtibo deleted the load_full_project branch May 3, 2019 07:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants