-
Notifications
You must be signed in to change notification settings - Fork 142
Annotated Dataset
Christopher Helm edited this page Jan 1, 2019
·
3 revisions
The annotated dataset JSON format is as follows:
- the name of the dataset;
- a description of the dataset;
- a regexTarget this optional field may contain one solution (when a solution exists and is already known) for the current dataset.
- a list of examples. Each example must contain:
- the "string", the raw example text
- two arrays, match and unmatch; match contains the target extractions and unmatch contains the target unextracted text, both stored as character ranges in the format
{ "start": 119, "end": 136 }
.
The only mandatory field is the examples list. In each element, all fields are mandatory, i.e., each example must contain the string, match and unmatch fields.
An example of annotated dataset JSON is as follows (repetitive parts have replaced with ...):
{
"name": "Log/MAC",
"description": "",
"regexTarget": "",
"examples": [
{
"string": "Jan 12 06:26:19: ACCEPT service http from 119.63.193.196 to firewall(pub-nic), prefix: \"none\" (in: eth0 119.63.193.196(5c:0a:5b:63:4a:82):4399 -> 140.105.63.164(50:06:04:92:53:44):80 TCP flags: ****S* len:60 ttl:32)",
"match": [
{ "start": 119, "end": 136 },
{ "start": 161, "end": 178 }
],
"unmatch": [
{"start": 0,"end": 119},
{"start": 136,"end": 161},
{"start": 178,"end": 215}
]
},
{
"string": "Jan 12 06:26:20: ACCEPT service dns from 140.105.48.16 to firewall(pub-nic-dns), prefix: \"none\" (in: eth0 140.105.48.16(00:21:dd:bc:95:44):4263 -> 140.105.63.158(00:14:31:83:c6:8d):53 UDP len:76 ttl:62)",
"match": [
…
]
"unmatch": [
…
]
}
]
}