Skip to content

Commit 93a7325

Browse files
committedApr 7, 2021
doc: Add more details to readme files
1 parent 07e8f31 commit 93a7325

File tree

2 files changed

+53
-33
lines changed

2 files changed

+53
-33
lines changed
 

‎README.md

+50-26
Original file line numberDiff line numberDiff line change
@@ -15,34 +15,58 @@ may fail if the `generate_shared_intermediates.py` module has not been run.
1515

1616
## Running Computations
1717

18-
Before running other modules, the `generate_shared_intermediates.py` module must be
19-
run to generate datasets derived from the archival data used by other modules.
18+
Analysis files which generate graphs or statistics from the data are located in
19+
the root directory of the project, and can be run individually after the
20+
environment is setup.
2021

21-
```shell script
22-
poetry run python generate_shared_intermediates.py
23-
```
22+
### Prequesites and environment setup
2423

25-
Analysis files which generate graphs or statistics from the data are located in
26-
the root directory of the project, and can be run individually after the shared
27-
intermediates have been generated.
28-
29-
The project has a concept of a "platform", which allows splitting data-intensive
30-
computation from visualization. Create a new `platform-conf.toml` file in the
31-
project root directory from the `example-platform-conf.toml` to customize the
32-
platform behavior. This can be useful when using a different server or cloud
33-
resources to prepare data with a local machine to generate visualizations.
34-
Details of the specific entries in `platform-conf.toml` are included as comments
35-
in the example file.
36-
37-
### Dependencies
38-
39-
Dependencies for the project are managed with poetry, an external tool you may
40-
need to install. After checking out the repository, run `poetry shell` in the
41-
root directory to get a shell in a virtual environment. Then run `poetry install`
42-
to install all the dependencies from the lockfile. If this command gives you
43-
errors, you may need to install some supporting libraries (libsnappy) for your
44-
platform. You will need to be in the poetry shell or run commands with poetry
45-
run to actually run scripts with the appropriate dependencies.
24+
1. Create the renders directory for final graph exports and scratch directory
25+
for checkpointed work in progress.
26+
27+
```
28+
mkdir renders
29+
mkdir scratch
30+
```
31+
32+
2. Download and extract the data (see [data](#data))
33+
34+
2. Install dependencies
35+
36+
Dependencies for the project are managed with poetry, an external tool you
37+
may need to install. After checking out the repository, run `poetry shell` in
38+
the root directory to get a shell in a virtual environment. Then run `poetry
39+
install` to install all the dependencies from the lockfile. If this command
40+
gives you errors, you may need to install some supporting libraries
41+
(libsnappy) for your platform. You will need to be in the poetry shell or run
42+
commands with poetry run to actually run scripts with the appropriate
43+
dependencies.
44+
45+
3. Create a platform file
46+
47+
The project has a concept of a "platform", which allows splitting
48+
data-intensive computation (which you probably want to do on a server) from
49+
visualization (which you might want to do on your own machine). Create a new
50+
`platform-conf.toml` file in the project root directory from the
51+
`example-platform-conf.toml` to customize the platform behavior. This can be
52+
useful when using a different server or cloud resources to prepare data with
53+
a local machine to generate visualizations.
54+
55+
The scripts are setup to generate intermediate reduced datasets in the
56+
`./scratch` directory when extensive pre-computation is needed. This
57+
directory can then be copied from the server to your local machine to run the
58+
actual visualizations. Details of the specific entries in
59+
`platform-conf.toml` are included as comments in the example file.
60+
61+
4. Generate the shared intermediate files.
62+
63+
Before running other modules, the `generate_shared_intermediates.py` module
64+
must be run to generate datasets derived from the archival data used by other
65+
modules.
66+
67+
```shell script
68+
poetry run python generate_shared_intermediates.py
69+
```
4670

4771
### Dask
4872
The uncompressed dataset size as of March 2020 is too large to fit on

‎data/README.md

+3-7
Original file line numberDiff line numberDiff line change
@@ -17,22 +17,18 @@ project root.
1717
3. Move the dataset exported files to their expected locations
1818

1919
```
20-
mv export/flows ./
21-
mv export/log_gaps_TM.parquet ./
22-
mv export/user_active_deltas.parquet ./
23-
mv export/transactions_TM.parquet ./
20+
mv export clean
2421
```
2522

26-
4. Remove the download tar.gz file and extra readme left in the export diretory
23+
4. Remove the download tar.gz file
2724
```
28-
rm -r export
2925
rm dataset.tar.gz
3026
```
3127
3228
4. Validate the structure of your data directory. Your data directory should look like this:
3329
3430
```
35-
./
31+
clean
3632
├── flows
3733
│   ├── p2p_TM_DIV_none_INDEX_start
3834
│   │   ├── _common_metadata

0 commit comments

Comments
 (0)