experimental - PLEASE BE CAREFUL. Intended for reasearch purposes.
This is a fork of the tor-browser-crawler, updated to run correctly with updated libraries and TBB version.
This project may be run natively on the host system or in a docker container.
If running natively, the required libraries and python 3.X modules must be installed.
Reference the Dockerfile and requirements.txt for the list of requirements.
Running through docker is easier and more reproducible.
As such, this section will focus on the docker container setup.
- Install Docker
- follow their documentation
- don't forget to add your user to the
dockergroup after install
- Build the docker container
- install the
makeutility if it is not native on your system - run
make buildto compile the docker image
- install the
- Setup your crawl configuration files
- replace
sites.txtwith the list of websites you wish to crawl - edit
Makefileto use the correct network interface of your host - adjust the
--timeoutvalue in theMakefileto higher values if needed - make any desired changes to
config.ini
- replace
- Start the crawl
- run
make runto launch a container - the logs and packet captures should appear in the newly created
resultsdirectory
- run
-
Library Versions
- versions of some components are important as different version combinations may be incompatible
- this project has been frozen to v10.0.10 of the TBB
- to use the latest TBB version, remove the version number from the
dockerfile - newer versions of TBB may however require different version of selenium and geckodriver
-
Crawler has been modified to use the
tcpdumputiltity in place ofdumpcapto capture traffic.- This avoids runtime issues that exist when using
dumpcapon some system configurations.
- This avoids runtime issues that exist when using