Scripts in the curation/
folder are used to gather data and apply some basic
preprocessing so that it launch-ready for downstream feature engineering and
encoding steps. The walk-through here builds a dataset from two separate
sources: NVD records, which constitute the features,
and Exploit DB exploits, which constitute
positive labels.
All steps assume that jq is installed, and that the repo is cloned locally and is the working directory for all shell commands.
The script curation/fetch-nvd.sh
pulls compressed JSON files from the NVD repository and unpacks these as a single file of JSON records. It takes a destination path prefix as an optional argument.
$ mkdir ~/cve-data
$ ./curation/fetch-nvd.sh ~/cve-data
$ wc -l ~/cve-data/nvd-records.jsonl
112294 /Users/.../nvd-records.jsonl
Inspecting the first few records of the file reveals how deeply nested the JSON schema is:
$ head ~/cve-data/nvd-records.jsonl | jq
The first transform we apply selects attributes of interest and makes them
top level keys. It relies on a predefined JQ script in the config/baseline/
folder:
$ cat ~/cve-data/nvd-records.jsonl \
| jq -c -f config/baseline/prune-nvd.js \
> ~/cve-data/nvd-pruned.jsonl
$ head ~/cve-data/nvd-pruned.jsonl | jq
Exploit DB maintains a version of its database as a simple file tree that is synced as a GitHub repository. To extract labels from entries, a script is provided that (naively) associates exploits to CVEs by string matching on the data:
$ git clone https://github.com/offensive-security/exploit-database.git ~/cve-data/exploit-db
$ ./curation/parse-cve.py ~/cve-data/exploit-db.jsonl --exploit-db ~/cve-data/exploit-db
The --exploit-db
argument should point to the local path prefix of the
Exploit DB repository. The positional argument is target file. Each line is
a JSON object that represents a one-to-many relationship between a single
CVE ID and associated exploits.
$ head ~/cve-data/exploit-db.jsonl |jq
The curation/merge.py
script is similar to the Linux join utility, except that it operates on JSON record and joins on a key name instead of a fixed
column:
$ ./curation/merge-json.py \
~/cve-data/nvd-pruned.jsonl \
~/cve-data/exploit-db.jsonl \
~/cve-data/nvd-edb-merged.jsonl --keyname cveid
The first and second arguments are files where all records are expected to
share a common key specified by the --keyname
argument. The third argument
is the output, effectively a "left outer join," where each record from the
second file is merged into the corresponding record from the first where
their specified keys coincide.
The --keyname
field must serve as a primary key across both input files.
If the same value is repeated across multiple records, later records in the
file will overwrite the earlier ones.