Skip to content

Latest commit

 

History

History
70 lines (50 loc) · 3.1 KB

Data-acquisition.md

File metadata and controls

70 lines (50 loc) · 3.1 KB

Label engineering: NVD and Exploit-DB

Scripts in the curation/ folder are used to gather data and apply some basic preprocessing so that it launch-ready for downstream feature engineering and encoding steps. The walk-through here builds a dataset from two separate sources: NVD records, which constitute the features, and Exploit DB exploits, which constitute positive labels.

All steps assume that jq is installed, and that the repo is cloned locally and is the working directory for all shell commands.

Step 1. Collect and clean raw features.

The script curation/fetch-nvd.sh pulls compressed JSON files from the NVD repository and unpacks these as a single file of JSON records. It takes a destination path prefix as an optional argument.

$ mkdir ~/cve-data
$ ./curation/fetch-nvd.sh ~/cve-data
$ wc -l ~/cve-data/nvd-records.jsonl
   112294 /Users/.../nvd-records.jsonl

Inspecting the first few records of the file reveals how deeply nested the JSON schema is:

$ head ~/cve-data/nvd-records.jsonl | jq

The first transform we apply selects attributes of interest and makes them top level keys. It relies on a predefined JQ script in the config/baseline/ folder:

$ cat ~/cve-data/nvd-records.jsonl \
    | jq  -c -f config/baseline/prune-nvd.js \
    > ~/cve-data/nvd-pruned.jsonl
$ head ~/cve-data/nvd-pruned.jsonl | jq

Step 2. Collect and clean raw labels

Exploit DB maintains a version of its database as a simple file tree that is synced as a GitHub repository. To extract labels from entries, a script is provided that (naively) associates exploits to CVEs by string matching on the data:

$ git clone https://github.com/offensive-security/exploit-database.git ~/cve-data/exploit-db
$ ./curation/parse-cve.py ~/cve-data/exploit-db.jsonl --exploit-db ~/cve-data/exploit-db

The --exploit-db argument should point to the local path prefix of the Exploit DB repository. The positional argument is target file. Each line is a JSON object that represents a one-to-many relationship between a single CVE ID and associated exploits.

$ head ~/cve-data/exploit-db.jsonl |jq

Step 3. Merge the features and labels

The curation/merge.py script is similar to the Linux join utility, except that it operates on JSON record and joins on a key name instead of a fixed column:

$ ./curation/merge-json.py \
    ~/cve-data/nvd-pruned.jsonl \
    ~/cve-data/exploit-db.jsonl \
    ~/cve-data/nvd-edb-merged.jsonl --keyname cveid

The first and second arguments are files where all records are expected to share a common key specified by the --keyname argument. The third argument is the output, effectively a "left outer join," where each record from the second file is merged into the corresponding record from the first where their specified keys coincide.

The --keyname field must serve as a primary key across both input files. If the same value is repeated across multiple records, later records in the file will overwrite the earlier ones.