This is really amazing for our use. PolySpider - New Android Crawler
Python(v2.7) and Scrapy(v0.20.2) are used in this project, which is mainly designed for Android app synchronizing and categorizing by focusing on grabing data from Android Markets.
- Python v2.7+
- Scrapy v0.20+
- redis-py v2.9+
- Supervisor v3.0+
- Dependencies are listed in Installation
- ak & sk & bucket name of BaiYun or Upyun for files upload
- Tested in Windows(both 32&64bit) and CentOS 6.4(64bit)
-
master branch
the latest stable version, which can be currently used in Poly Project. NEVER EVER straightly commit codes in MASTER branch.
-
develop branch
the developing unstable version. team contributors should take code imporvement in this branch. If the project in develop branch comes to a stable level and meets the product requirement. The MASTER branch will merge the pull request from develop branch and release a stable branch version.
-
stable branch
we use
numberslike0.1,0.3,1.0as a stable release from this project. -
gh-pages branch
static resources and website page
###Run Single Spider
- Step into
PolySpider/src/directory - Use
scrapy listcommand to find spiders this project has provided - Use
scrapy crawl spidernamecommand to start the crawler, which will crawl the target app market and then record the crawled app information into sqlite database, download the apk file and parse it to get the info_list including package name, app name etc. If needed, it will upload the apk file to Cloud Storage like BaiduYun and UpYun. - Since the app info is stored in sqlite database, you can use
python check_sql_data.pycommand to check out what info the database has for convenience or just use some SqliteBrowser tools.
###Run supervisor Supervisor is a client/server system that allows its users to control a number of processes on UNIX-like operating systems.
- All configuration setting of Supervisor is included in
PolySpider/src/supervisor.conf - Step into
PolySpider/src/directory - User
supervisord -c supervisor.confto start supervisor process and all python processes managed by Supervisor will start automatically. Moreover, Supervisor will monitor the process and restart them if the processes are interrupted or quit unexpectedly。 - Admin can watch the Supervisor status in browser with the address
localhost:9001by default and there are some oprations could be taken on the python processes likerefresh,stop,restartand so on. - Directory named
PolySpider/src/tmp/contains log files of Supervisor itself and other processes. Feel free to check it out!