Naming conventions: "EvidenceForge" is the product name,
evidenceforgeis the Python package name,eforgeis the CLI command name.
EvidenceForge is a system for generating realistic synthetic security logs for cybersecurity threat hunting training and research. The system uses a two-phase architecture:
Phase 1 - Scenario Creation (Skill-assisted): Claude Code skills (/eforge scenario) guide users through an interactive interview to build structured scenario YAML files. The skill uses a hybrid interview flow -- structured questions first for core requirements, then free-form conversation to fill gaps and refine details. Users can also hand-author or edit scenario YAML directly.
Phase 2 - Log Generation (Deterministic): The eforge generate CLI command executes the scenario plan without any LLM calls, producing large-scale, temporally consistent datasets across multiple log formats (Windows Event Logs, Zeek, ECAR, Syslog, Bash History, Snort, web logs) with coordinated cross-references (matching LogonIDs, PIDs, session data, connection IDs, etc.).
This architecture combines the flexibility and domain expertise of LLM-assisted authoring with the speed, cost-efficiency, and reproducibility of deterministic generation.
Unlike existing tools that focus solely on attack simulation or use purely programmatic generation, this system:
- Generates coordinated multi-source logs (not single format)
- Supports both baseline "normal" activity and injected attack scenarios
- Maintains realistic temporal patterns and behavioral variation via persona-based activity distribution
- Provides ground truth about malicious activities for threat hunting exercises
- Models network topology and sensor placement for realistic traffic visibility
The tool addresses the need for realistic, large-volume training datasets without the privacy/security concerns of production data.
- Generate realistic synthetic logs for 7 formats: Windows Event Security, Zeek conn, ECAR, Syslog, Bash History, Snort alerts, W3C web access
- Claude Code skills for scenario creation (
/eforge scenario) and generation troubleshooting (/eforge generate) - Skill installer command (
eforge install-skills) for project-level or global installation - Pre-built persona library for common organizational roles
- Maintain cross-log consistency (events reference same LogonIDs, PIDs, timestamps, connection IDs)
- Support arbitrary time windows from hours to weeks
- Handle datasets from small (classroom exercises) to large (multi-day, 500+ users)
- Parallel generation at emitter level (different log formats simultaneously) with incremental writing
- Progress reporting during generation with per-hour and per-storyline-event tracking
- Schema validation for scenario files (Pydantic-based)
- Cross-reference validation (users, systems, personas, groups referenced correctly)
- Evaluation framework with concrete metrics (format compliance, consistency, statistical properties)
- Ground truth documentation (GROUND_TRUTH.md) for every generated scenario
- Network topology and sensor placement modeling for traffic visibility
- Persona-based temporal activity distribution with configurable work hours, intensity, and risk profiles
- Comprehensive test coverage (95%+) with pytest
- Flexible timezone handling (UTC internal, configurable per-system/format for output)
- Bit-perfect reproducibility via seed (save scenario file for reuse instead)
- Subjective "does this feel real?" evaluation beyond concrete metrics
- Config file inheritance/templating
- Built-in LLM client for semantic validation (deferred; use Claude Code skills for now)
- Checkpointing and resume for long-running generation jobs
- Support for LLM backends beyond Claude Code skills (Bedrock client, OpenAI, Ollama)
- PyPI package distribution (MVP is git clone + local install)
- Pre-built binaries or container images
- Streaming output to SIEM/data lakes
- OT/ICS environment simulation
- Mobile device logs
- Cloud provider logs (CloudTrail, Azure Activity, GCP Audit)
- Time-slice or user-level parallelization (MVP parallelizes at emitter level only)
- Large dataset optimization (100M+ events, memory-mapped writes)
Primary Users:
- Security Researchers: Need realistic datasets for developing detection algorithms and threat hunting techniques
- Threat Hunters: Require practice datasets with known ground truth for training and skill development
- Security Educators: Must create reproducible scenarios for classroom exercises and labs
- SOC Trainers: Need varied, realistic datasets for analyst training programs
- Detection Engineers: Require test data for validating detection rules and SIEM configurations
User Context:
- All users are expected to have Claude Code installed (skills are the primary scenario authoring interface)
- Mix of technical proficiency (from educators who may not code to researchers who do)
- Need for both quick scenario generation via skills and detailed customization via YAML editing
- Often simulating specific real-world environments or generic representative environments
- May need same scenario run multiple times with variations
- Can use
/eforge scenarioskill for guided creation or hand-author YAML for precise control
eforge init [--force]- System copies
config.example.yamltoconfig.yamlin the current directory - Includes documented parameters for output paths and logging
- User can customize for their environment
/eforge scenario
- Claude Code skill starts a hybrid interview flow
- Structured phase -- asks targeted questions about:
- Environment (size, type of organization, systems, users)
- Network topology and sensor placement
- Baseline activity patterns (reference pre-built personas or define custom)
- Specific attack scenarios or activities to inject
- Time windows and output requirements
- Free-form phase -- identifies gaps in the scenario and asks open-ended questions:
- Refine persona behaviors
- Add detail to attack storylines
- Clarify network visibility requirements
- User can specify at any level of detail:
- High-level: "50-person financial services company" (skill fills in details)
- Mixed: 10 specific users, generate 40 more with personas
- Detailed: Exact usernames, hostnames, IPs, file paths, timezones
- Skill generates complete scenario YAML file conforming to the schema in Section 4.2
- Saves to disk for review/editing/reuse
- No LLM calls needed during generation phase
eforge install-skills [--project | --global]- Copies skill files from the repo's
commands/eforge/directory --project(default): Installs to.claude/commands/in the current project--global: Installs to~/.claude/commands/- Reports which skills were installed and their slash-command triggers
eforge validate SCENARIO_FILE- Schema validation: Check YAML structure, data types, required fields via Pydantic models
- Cross-reference validation: Verify internal consistency
- Referenced users exist in environment
- Referenced systems exist in environment
- Referenced personas are defined
- Group members reference valid users
- Storyline actors reference valid users
- Time sequences are within the defined window
- Report all validation issues with field paths, descriptions, and suggestions
- Return exit code 0 on success, exit code 2 on schema failure
eforge generate SCENARIO_FILE [--output DIR] [--verbose] [--debug]- Load and validate scenario file (schema + cross-reference validation)
- Load format definitions for requested log types
- Initialize generation state (users, systems, sessions, processes, connections)
- Start parallel emitters (one per log format, shared read-only state access)
- Generate baseline activity:
- Execute persona-based activity patterns for all users
- Apply realistic temporal distributions throughout time window
- StateManager tracks all sessions, processes, connections
- Layer storyline activities on top of baseline:
- Execute detailed event sequences at specified times
- Suppress baseline for affected users during storyline (+/-5 min window)
- Each emitter writes coordinated logs with consistent cross-references
- Convert timestamps from UTC to system/format-specific timezones as configured
- Write to organized directory structure with incremental flushing (10K event buffer)
- Show progress with Rich progress bars (per-hour baseline, per-event storyline)
- Log details to
generation.login output directory - Generate GROUND_TRUTH.md and OBSERVATION_MANIFEST.json sidecars
eforge evaluate OUTPUT_DIR [--report REPORT_FILE] [--verbose]- Load generated logs
- Run validation checks:
- Format compliance (syntactically valid against format definitions)
- Consistency (cross-references resolve correctly)
- Statistical properties (distributions, timing patterns)
- Completeness (no orphaned references)
- If GROUND_TRUTH.md exists, validate that all documented IOCs are present in logs
- Generate report with scores and specific findings
- Optional: Save report for comparison across runs
Primary file: scenario-name.yaml
version: string # Schema version, e.g., "1.0"
# If schema version is not "1.0", reject with error:
# "Unsupported schema version. This tool supports version 1.0."
name: string # Human-readable scenario name
description: string # Multi-line natural language description
environment:
description: string # Natural language environment description
timezone:
default: string # Default timezone for all systems (e.g., "UTC", "America/New_York")
systems: # Per-system overrides (optional)
pattern: string # e.g., "WS-NYC-*": "America/New_York"
# Note: The /eforge scenario skill handles auto-generating users/systems from
# high-level descriptions (e.g., "50-person financial services company").
# The final scenario YAML only contains explicit users and systems lists.
users:
- username: string
full_name: string
email: string
persona: string # Optional: Reference to persona definition; if omitted, user generates no activity
primary_system: string # Required in current implementation: reference to system hostname
groups: list[string] # List of group names
enabled: boolean # If false, user exists in environment but generates no activity
systems:
- hostname: string
ip: string # Single IP address (multi-NIC out of scope for MVP)
os: string
type: string # workstation|server|domain_controller
assigned_user: string # Optional, for workstations
services: list[string] # Optional: Service names like "IIS", "SSH", "SQL Server" (not ports)
roles: list[string] # Optional but strongly recommended for servers/proxies (drives world-model host capabilities)
# If omitted, auto-populated from OS type:
# Windows: ["dns-client", "ntp-client", "smb", "windows-update"]
# Linux: ["dns-client", "ntp-client", "syslog"]
# Roles and services feed the compiled world model used for realistic session routing,
# baseline lateral movement, and infrastructure selection
# Explicit values override auto-population entirely (no merge)
groups:
- name: string
description: string
members: list[string] # Usernames
permissions: list[string]
network:
segments:
- name: string # Segment identifier (e.g., "workstations", "servers", "dmz")
cidr: string # CIDR notation (e.g., "10.0.10.0/24")
description: string # Human-readable description
systems: list[string] # Optional: Hostnames in this segment (inferred from system IPs if omitted)
sensors:
- type: string # network|ids|firewall (determines which log formats this sensor generates)
name: string # Sensor identifier
monitoring_segments: list[string] # Segment names this sensor monitors
direction: string # inbound|outbound|bidirectional (what traffic is visible)
log_formats: list[string] # Which formats this sensor generates (e.g., ["zeek_conn", "snort_alert"])
description: string # Optional description
# Note: Network topology defines which connections are observable by sensors.
# Only traffic visible to configured sensors generates network log entries.
personas:
- name: string
description: string # Natural language behavior description
typical_activities: list[string] # High-level activities
work_hours: string # e.g., "8am-6pm with variation"
application_usage: list[string]
risk_profile: string # low|medium|high (affects activity intensity/variation)
expanded_activities: # Detailed activity patterns with frequencies
- activity: string # Concrete activity
frequency: float # Events per hour
processes: list[string]
network_targets: list[string]
file_patterns: list[string]
time_window:
start: datetime # ISO 8601 format in UTC (YYYY-MM-DDTHH:MM:SSZ or +00:00)
end: datetime # Either end (ISO 8601 UTC)...
# OR
duration: string # ...or duration (exact time span: "10h", "3d", "2h30m")
# Exactly one of end or duration must be specified
baseline_activity:
description: string # Natural language description
intensity: string # low|medium|high -> events/user/hour: low=5, medium=15, high=40
variation: string # low|medium|high -> timing stddev: low=+/-10%, medium=+/-25%, high=+/-50%
# Note: Persona risk_profile modifies intensity (low=-5, high=+10 events/hour)
storyline:
- time: string # Time formats:
# - ISO 8601 timestamp (must be within window)
# - Relative offset: "+2h30m" or "+2h" or "+150m"
# - Offset in seconds: "+7200"
actor: string # Actor specification:
# - Specific username: "bwilliams"
# - Threat actor: "APT29", "SCATTERED SPIDER", "Red Team Alpha"
# - Generic: "attacker"
# - Note: Multiple distinct actors supported in same scenario
system: string # Target system hostname
activity: string # Natural language activity description
details: dict # Flexible activity-specific details:
# Common examples (not exhaustive):
source_ip: string # For external attackers (e.g., details.source_ip)
url: string # For web activities
file: string # For file operations
binary: string # For process execution
command: string # For command execution
target_system: string # For lateral movement
stolen_creds: string # For credential usage
event_sequence: list # Detailed event sequence with specific log artifacts
- event_type: string
log_sources: list[string] # Which log formats show this event
fields: dict # Specific field values for each log source
output:
logs:
- format: string # windows_event_security|zeek_conn|ecar|syslog|bash_history|snort_alert|web_access
variant: string # Optional: Security|System|conn|http|auth|access
timezone: string # "system" (use system's timezone) or explicit "UTC"/"America/New_York"
options: dict # Format-specific options
destination: string # Output directory path
compression: boolean # Compress output files (gzip)
# Format-specific options (output format, headers, etc.) are future enhancements.Format Definitions (src/evidenceforge/formats/definitions/{format_name}.yaml)
format:
name: string
description: string
category: string # windows|linux|network|web|application
common_fields:
- name: string
type: string # See Type System below
required: boolean
range: list[integer] # For numeric types (min, max)
enum: list[any] # Allowed values (mutually exclusive with range)
pattern: string # Regex pattern for validation
default: any
# Type System for Format Definitions:
# - datetime: ISO 8601 timestamp, rendered per format (epoch, ISO, custom)
# - integer: 64-bit signed integer
# - float: IEEE 754 double-precision floating point
# - string: UTF-8 string
# - ip_address: IPv4 or IPv6 address
# - ipv4: IPv4 address specifically
# - ipv6: IPv6 address specifically
# - hex_string: Hexadecimal string (e.g., "0xC000006D")
# - boolean: true/false
# - port: Integer 1-65535
# - mac_address: MAC address (colon or hyphen separated)
# - hostname: DNS hostname (RFC 1123)
# - fqdn: Fully qualified domain name
# - email: Email address
# - url: URL (http/https)
# - uuid: UUID v4
# - base64: Base64-encoded string
# Format-Specific Precision Requirements:
# - Zeek timestamps: Epoch float with exactly 6 decimal places (microsecond precision)
# Format: f"{timestamp:.6f}" to preserve trailing zeros during JSON serialization
# - Windows Event timestamps: ISO 8601 with millisecond precision (YYYY-MM-DDTHH:MM:SS.sssZ)
# - Syslog timestamps: RFC 3339 format with timezone offset
variants: # For formats with subtypes (channels, log types)
- name: string
description: string
fields: list[field] # Same structure as common_fields
validators:
- rule: object # JSON Logic expression (see http://jsonlogic.com)
error: string # Error message if validation fails
output_template: string # Jinja2 template for rendering final log format (most emitters)
# eCAR builds JSON directly in Python (no template)
# Available context: all field values as variables, timestamp(), hex(), escape()
# Validator examples using JSON Logic:
# Success status can't have failure reason:
# {"and": [{"==": [{"var": "Status"}, "0x0"]}, {"!=": [{"var": "FailureReason"}, null]}]}
#
# Network logon requires IP address:
# {"and": [{"in": [{"var": "LogonType"}, [3, 10]]}, {"==": [{"var": "IpAddress"}, "-"]}]}The generator maintains these state structures during execution:
@dataclass
class ActiveSession:
logon_id: str
username: str
system: str
logon_type: int
start_time: datetime
source_ip: str
@dataclass
class RunningProcess:
pid: int
parent_pid: int
image: str
command_line: str
username: str
system: str
start_time: datetime
integrity_level: str
@dataclass
class OpenConnection:
conn_id: str
src_ip: str
src_port: int
dst_ip: str
dst_port: int
protocol: str
state: str
start_time: datetime
bytes_sent: int
bytes_received: int
@dataclass
class GeneratorState:
active_sessions: dict[str, ActiveSession]
running_processes: dict[int, RunningProcess]
open_connections: dict[str, OpenConnection]
dns_cache: dict[str, str]
current_time: datetime
user_states: dict[str, UserState] # Current activity per userDirectory Structure
Generated logs are written to a timestamped output directory:
output/
scenario-name-YYYYMMDD-HHMMSS/
generation.log # Detailed generation log
GROUND_TRUTH.md # Ground truth sidecar (empty for baseline-only scenarios)
OBSERVATION_MANIFEST.json # Source-observation sidecar
OUTPUT_TARGET.txt # default or sof-elk output target
windows_events.xml # Windows Event Logs
zeek_conn.log # Zeek connection logs
ecar.json # ECAR events
<linux-host>/syslog.log # Linux syslogs (default target)
bash_history.log # Bash history entries
snort_alerts.log # Snort/Suricata alerts
<firewall>/cisco_asa.log # Cisco ASA firewall syslogs (default target)
web_access.log # Web/proxy logs
GROUND_TRUTH.md Format
Every successful generation creates a GROUND_TRUTH.md file. Attack/red-herring scenarios document the narrative, timeline, and IOCs for training and evaluation; baseline-only scenarios explicitly state that no malicious events were generated.
# Ground Truth: [Scenario Name]
Generated: YYYY-MM-DD HH:MM:SS UTC
Time Window: [start] to [end]
## Attack Summary
[Narrative description of the malicious/suspicious activities. Excludes benign baseline
activity. Describes the attack from initial access through objectives, including
techniques used, systems compromised, data accessed, etc.]
## Timeline
Chronological sequence of key malicious events. Each entry includes:
- Timestamp (ISO 8601 format)
- Optional record ID (EventRecordID, UID, line number) if applicable
- Human-readable description with relevant context
Format:
YYYY-MM-DDTHH:MM:SS.ssssssZ [RecordID: 12345] - Description with IOCs
Example:
2024-01-15T10:23:45.123456Z [EventRecordID: 12345] - Initial access: Threat actor logged in to WIN-TEST-01 as CORP\jdoe from source IP 104.248.71.33
2024-01-15T10:24:12.789012Z - C2 communication: Outbound connection from 192.168.1.100 to C2 server 45.83.221.45:443
2024-01-15T10:25:03.456789Z [EventRecordID: 12389] - Credential dumping: Process mimikatz.exe (PID 4532) executed by CORP\jdoe
## Indicators of Compromise (IOCs)
Atomic indicators that can be searched for in the logs to identify malicious activity.
Grouped by type for easy reference.
### Network Indicators
- Attacker IP addresses: 104.248.71.33, 45.83.221.45
- C2 domains: evil-c2.example.com, malware-download.net
- C2 IP:Port combinations: 45.83.221.45:443, 45.83.221.45:8080
### User Accounts
- Compromised accounts: CORP\jdoe, CORP\admin-backup
- Created accounts: CORP\backdoor-admin
### Host Indicators
- Compromised systems: WIN-TEST-01, WIN-TEST-05, DC-01
- Malicious processes: mimikatz.exe, nc.exe, evil-payload.exe
- Process IDs: 4532 (mimikatz.exe), 5123 (nc.exe)
- File paths: C:\Temp\mimikatz.exe, C:\Users\jdoe\Downloads\payload.exe
- Command lines: "mimikatz.exe privilege::debug sekurlsa::logonpasswords"
### Other Indicators
- [Additional categories as relevant: registry keys, scheduled tasks, services, etc.]Purpose:
- Provides ground truth for threat hunting training exercises
- Enables validation that detection rules capture the malicious activity
- Documents the attack narrative for educational purposes
- Lists atomic IOCs for direct searching in SIEM/analysis tools
Generation:
- Created automatically during log generation when storyline contains malicious activities
- Not generated for baseline-only scenarios (no malicious activity)
- IOCs extracted from actual generated events (guaranteed to be present in logs)
- Timeline includes only key events (not every single malicious log entry)
Command: init
eforge init [--force]
Options:
--force Overwrite existing config.yaml if it exists
Creates config.yaml from config.example.yaml in the current directory.
Non-interactive: Simply copies the example config with all options documented.
Command: install-skills
eforge install-skills [--project | --global]
Options:
--project Install skills to .claude/commands/ in the current project (default)
--global Install skills to ~/.claude/commands/
Copies EvidenceForge skill files to the appropriate Claude Code skills location.
Skill files are bundled as package data and loaded via importlib.resources at runtime.
Skills installed:
/eforge scenario - Guided scenario creation
/eforge generate - Generation with troubleshooting
Command: validate
eforge validate SCENARIO_FILE
Arguments:
SCENARIO_FILE Path to scenario YAML file
Validates scenario file for schema correctness and cross-reference integrity.
Exit codes: 0 = success, 1 = YAML parse error, 2 = schema/cross-reference error.
Checks performed:
- YAML parsing and Pydantic schema validation
- All referenced users exist in environment.users
- All referenced systems exist in environment.systems
- All referenced personas are defined in personas section
- Group members reference valid users
- Storyline actors reference valid users or external actors
- Storyline times fall within the defined time window
- Network segment and sensor references are valid
Command: generate
eforge generate SCENARIO_FILE [--output DIR] [--verbose] [--debug]
Arguments:
SCENARIO_FILE Path to scenario YAML file
Options:
--output, -o Override output directory from scenario file
--verbose, -v Enable INFO level logging
--debug, -d Enable DEBUG level logging
Generates logs according to scenario specification.
No LLM calls during generation (purely deterministic).
Shows progress bars and writes detailed logs to output directory.
Performs schema + cross-reference validation before generation starts.
Command: evaluate
eforge evaluate OUTPUT_DIR [--report REPORT_FILE] [--verbose]
Arguments:
OUTPUT_DIR Path to generated log directory
Options:
--report Path to write evaluation report (default: OUTPUT_DIR/evaluation.json)
--verbose Include detailed findings in report
Evaluates generated logs for concrete metrics:
- Format compliance: Events parse successfully against format definitions
- Consistency: Cross-references resolve (LogonIDs, PIDs, connection IDs)
- Statistical properties: Event type distributions, timing patterns
- Completeness: No orphaned references
- Ground truth validation: If GROUND_TRUTH.md exists, verify all documented IOCs are present
Report is informational only (no pass/fail thresholds for MVP).
Outputs JSON report with scores and specific findings.
Evaluation Report Schema (minimal)
The evaluation report is a JSON file with the following top-level structure. The exact sub-structure of each section will be refined during implementation.
{
"format_compliance": { "...": "per-format parse/validation results" },
"cross_ref_consistency": { "...": "orphaned references, mismatched IDs" },
"ground_truth": { "...": "IOC presence verification (if GROUND_TRUTH.md exists)" },
"summary": { "total_checks": 0, "passed": 0, "failed": 0, "warnings": 0 }
}Command: version
eforge version
Shows version information.
EvidenceForge uses Claude Code skills as the primary scenario authoring interface. Skills are Markdown files that provide Claude Code with domain-specific instructions, enabling it to guide users through complex scenario creation without requiring a built-in LLM client.
Skills live in commands/eforge/ in the repository and are installed via eforge install-skills.
/eforge scenario -- Guided scenario creation skill
Responsibilities:
- Interview users about their scenario requirements using a hybrid flow
- Structured phase: targeted questions about environment, users, systems, network, personas, storyline, time window, output formats
- Free-form phase: identify gaps, refine details, ask follow-up questions
- Reference the pre-built persona library and suggest appropriate personas
- Generate valid scenario YAML conforming to the schema
- Validate the generated YAML against known constraints before saving
- Save the file and suggest next steps (
eforge validate,eforge generate)
/eforge generate -- Generation with troubleshooting skill
Responsibilities:
- Run
eforge generateon a scenario file - If generation fails, analyze the error output
- Suggest fixes for common issues (schema errors, missing references, invalid time windows)
- Optionally edit the scenario file to fix issues and retry
- Report summary of generated output on success
- Hybrid interview flow: Start with structured questions to gather core requirements quickly, then switch to free-form conversation for gap-filling and refinement
- Progressive disclosure: Ask simple questions first, offer advanced options only when relevant
- Persona-aware: Reference the pre-built persona library to reduce authoring effort
- Schema-aware: Skills know the exact scenario YAML schema and generate conforming output
- Idempotent suggestions: Skills suggest CLI commands the user can verify and run
# Install to current project (default)
eforge install-skills --project
# Creates .claude/commands/eforge-scenario.md
# Creates .claude/commands/eforge-generate.md
# Install globally for all projects
eforge install-skills --global
# Creates ~/.claude/commands/eforge-scenario.md
# Creates ~/.claude/commands/eforge-generate.md
Skills are plain Markdown files and can be version-controlled, customized, or extended by users.
- Small datasets (1 hour, 50 users, ~10K events): < 15 seconds generation time
- Medium datasets (8 hours, 100 users, ~100K events): < 30 seconds generation time (current benchmark: ~14 seconds)
- Large datasets (8 hours, 500 users, ~1M events): < 30 minutes generation time
- Memory usage: < 2GB regardless of output size (soft target, not enforced)
- Streaming writes with 10K event buffer per emitter
- State grows with active sessions/processes but typically < 100MB for large scenarios
- No automatic state pruning (realistic incompleteness is acceptable)
- Parallel generation at emitter level: Different log formats write simultaneously
- Shared StateManager with thread-safe read access
- Each emitter runs in separate thread
- Support up to 1000 users and 2000 systems in a single environment
- Handle time windows up to 30 days
- Generate up to 100M events in a single run
- Input validation: Schema + cross-reference validation before generation starts (fail fast)
- Atomic writes: Use temp files + rename for log files
- Resource exhaustion: Check disk space before starting (require 2x estimated output size), fail if insufficient
- Never log AWS credentials or other secrets
- Support AWS credential chain (no credentials in config files)
- Environment variable interpolation for sensitive values
- .env file support with search from current directory up to home
- Format definition validation: Constrained DSL only, no arbitrary code execution from untrusted format files
- Progress reporting with Rich progress bars for long-running jobs
- Clear, actionable error messages with field paths and suggestions
- Examples and templates included
- Comprehensive documentation
- Skills provide guided authoring for users who prefer not to write YAML manually
- 95%+ test coverage across all components
- Type hints throughout codebase
- Pydantic models for all data structures
- Clear separation of concerns (skill-assisted authoring / validation / generation / evaluation)
- Format definitions as data, not code
Core:
- Python 3.11+ (for latest type hint features)
- uv for package management and script/tool support
- Pydantic v2 for data validation and schema management
CLI & Output:
- Typer for CLI framework
- Rich for progress bars and console formatting
- Jinja2 for log format templates (eCAR uses direct Python JSON construction)
- PyYAML for configuration parsing
- pytz for timezone handling
Skills:
- Claude Code skills (Markdown files in
commands/eforge/) - Installed via
eforge install-skillscommand - No runtime dependency on Claude Code for generation (skills are authoring-time only)
Testing:
- pytest for test framework
- pytest-cov for coverage reporting
- pytest-mock for mocking
- pytest-benchmark for performance tests
- Separate marker @pytest.mark.slow for large dataset tests (excluded from default run via --include-slow flag)
Format Support:
- Standard library json/csv for text formats
- Custom parsers/writers for Zeek, Syslog, ECAR, Bash History, etc.
- json-logic-qubit for format definition validation rules
evidenceforge/
+-- README.md
+-- AGENTS.md # AI coding agent instructions
+-- LICENSE
+-- pyproject.toml # uv project config (entry point: eforge)
+-- config.example.yaml # Example configuration
+-- .env.example # Example environment variables
|
+-- commands/ # Claude Code skills (source, installed via eforge install-skills)
| +-- eforge/
| +-- scenario.md # /eforge scenario skill
| +-- generate.md # /eforge generate skill
|
+-- personas/ # Pre-built persona library
| +-- developer.yaml
| +-- accountant.yaml
| +-- executive.yaml
| +-- help_desk.yaml
| +-- security_analyst.yaml
| +-- ... # 10-15 common personas
|
+-- src/
| +-- evidenceforge/
| +-- __init__.py
| +-- __main__.py # CLI entry point
| +-- py.typed # PEP 561 marker
| |
| +-- cli/
| | +-- __init__.py
| | +-- commands.py # CLI command implementations (init, generate, validate, evaluate, install-skills, version)
| |
| +-- models/
| | +-- __init__.py
| | +-- config.py # Pydantic models for config
| | +-- scenario.py # Pydantic models for scenario
| | +-- exceptions.py # Custom exception types
| | +-- state.py # Runtime state models
| |
| +-- validation/
| | +-- __init__.py
| | +-- schema.py # Schema + cross-reference validation
| |
| +-- generation/
| | +-- __init__.py
| | +-- engine.py # Main generation orchestrator
| | +-- state_manager.py # State tracking (sessions, processes, connections)
| | +-- activity.py # Persona-based activity generation with temporal distribution
| | +-- ground_truth.py # GROUND_TRUTH.md generation
| | +-- network_visibility.py # TAP/SPAN sensor modeling
| | +-- emitters/
| | +-- __init__.py
| | +-- base.py # Base emitter interface
| | +-- windows.py # Windows Event Security emitter
| | +-- zeek.py # Zeek conn.log emitter
| | +-- ecar.py # ECAR event emitter
| | +-- syslog.py # Syslog emitter
| | +-- bash_history.py # Bash history emitter
| | +-- snort.py # Snort alert emitter
| | +-- web.py # Web access log emitter
| |
| +-- formats/
| | +-- __init__.py
| | +-- format_def.py # Pydantic models for format definitions
| | +-- loader.py # Format definition loader
| | +-- validator.py # Format constraint validator (JSON Logic DSL)
| | +-- definitions/
| | +-- windows_event_security.yaml
| | +-- zeek_conn.yaml
| | +-- ecar.yaml
| | +-- syslog.yaml
| | +-- bash_history.yaml
| | +-- snort_alert.yaml
| | +-- web_access.yaml
| |
| +-- llm/ # Created when LLM integration is needed (future)
| | +-- __init__.py
| |
| +-- evaluation/
| | +-- __init__.py
| | +-- evaluator.py # Main evaluation logic
| | +-- metrics.py # Concrete metrics (format, consistency, stats)
| | +-- report.py # Report generation
| |
| +-- utils/
| +-- __init__.py
| +-- config.py # Config loading with env var interpolation
| +-- files.py # File I/O utilities
| +-- ids.py # ID generation utilities
| +-- logging.py # Logging setup
| +-- time.py # Time/duration parsing utilities
|
+-- tests/
| +-- __init__.py
| +-- conftest.py # Shared fixtures
| +-- unit/ # Fast unit tests (526+ tests)
| | +-- test_models.py
| | +-- test_validation.py
| | +-- test_state_manager.py
| | +-- test_engine.py
| | +-- test_emitters.py
| | +-- test_activity.py
| | +-- test_persona_activity.py
| | +-- test_network_visibility.py
| | +-- test_ground_truth.py
| | +-- test_format_def.py
| | +-- test_format_loader.py
| | +-- test_format_validator.py
| | +-- test_time_parsing.py
| | +-- test_timezone_handling.py
| | +-- test_cli.py
| | +-- test_utils.py
| | +-- ...
| +-- integration/ # Multi-component tests
| | +-- test_format_definitions.py
| | +-- test_parallel_generation.py
| | +-- test_scenario_timezone.py
| | +-- test_medium_dataset.py
| | +-- ...
| +-- live/ # Tests requiring external APIs
| +-- fixtures/ # Test fixture data
|
+-- docs/
| +-- PRD.md # This document
| +-- ...
|
+-- examples/
+-- simple-baseline/ # Simple baseline activity scenario
+-- ransomware-attack/ # Ransomware scenario
+-- credential-stuffing/ # Credential attack scenario
+-- insider-threat/ # Insider threat scenario
Configuration Hierarchy (later overrides earlier):
- Default values in code
- System-wide config (if exists):
~/.config/evidence-forge/config.yaml - .env file (if exists): Search from current working directory upward to home directory, stop at first found (don't merge multiple)
- Project config:
./config.yaml - Command-line arguments
Secrets Handling:
- AWS credentials: Use standard boto3 credential chain (never in config files)
- Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, etc.)
- AWS credentials file (~/.aws/credentials) using specified profile
- IAM role (if running on EC2/ECS)
- AWS SSO
- Other secrets: Support environment variable interpolation in config:
${VAR_NAME} - .env file search: Walk from CWD upward, max search depth is home directory
- Security: Never log secrets or include in error messages, stack traces, or debug output
Exit Code Table:
| Code | Category | Description |
|---|---|---|
| 0 | Success | Operation completed successfully |
| 1 | Input Error | Malformed YAML or file I/O error |
| 2 | Schema Validation | Pydantic validation or cross-reference failure |
| 20 | Resource Exhaustion | Insufficient disk space or memory |
| 21 | Generation Error | Invalid state or unrecoverable generation failure |
| 22 | Format Error | Format definition loading or validation error |
| 130 | SIGINT | User interrupted (Ctrl+C) |
Malformed YAML:
- Detect during parsing
- Show line number and syntax error
- Suggest common fixes (indentation, quotes, etc.)
- Exit code: 1
Schema Violations:
- Validate against Pydantic models
- Show field path and expected type/constraint
- List all violations (don't stop at first)
- Exit code: 2
Cross-Reference Errors:
- Present issues with field paths, descriptions, and suggestions
- Distinguish errors (block generation) from warnings (proceed with caution)
- Exit code: 2 if errors present
Resource exhaustion:
- Memory: Stream writes, don't buffer all output
- Disk: Check available space before starting, fail fast if insufficient
- Exit code: 20
Invalid state:
- Detect impossible states (e.g., PID reuse collision)
- Log detailed state information
- Attempt recovery (assign new PID)
- If unrecoverable: fail with detailed error
- Exit code: 21
Format definition errors:
- Validate format definitions on load
- Show which format and which rule failed
- Fail before generation starts
- Exit code: 22
Empty time window:
- If start == end: Error
- If duration <= 0: Error
No users or systems defined:
- Require at least 1 user and 1 system
- Error with suggestion to add users/systems
Conflicting specifications:
- Duplicate usernames or hostnames: Error (must be unique)
- Storyline references non-existent user/system: Error with suggestion
Activity before window start or after end:
- Clamp to window boundaries with warning
- Log original vs adjusted time
User activity on unassigned system:
- If user has no
primary_systemand no assigned workstation: Error - If user activity needs to occur on another host, model it explicitly through storyline events or remote-session behavior rather than relying on implicit placement
Process tree inconsistencies:
- Parent PID doesn't exist: Use reasonable default (explorer.exe, init)
- Circular parent references: Error
Network impossibilities:
- Connection where src_ip == dst_ip: Skip with warning (network sensors cannot observe localhost traffic)
- Connection involving localhost addresses (127.0.0.0/8): Skip with warning (never traverses network)
- Connection involving link-local addresses (169.254.0.0/16): Skip with warning (auto-config, not routed)
- Connection involving multicast/reserved addresses (224.0.0.0/4): Skip with warning
- Connection to private IP from external actor: Warn (might be VPN/proxy, but allow)
- Response bytes > 0 for failed connection: Adjust to 0, warn
- Connection not visible to configured sensors: Skip based on network topology and sensor placement
Logon without logoff:
- Within time window: Acceptable and common (user still logged in, forgot to log off, system crash)
- At end of window: ~85% of sessions close properly, ~15% incomplete (realistic messiness)
- Storyline can specify incomplete sessions for attacker behavior
Time travel:
- Event A references Event B that happens later: Error
- Process termination before creation: Error
- All timestamps validated during generation
Duplicate identifiers:
- Two users with same username: Error (must be unique)
- Two systems with same hostname: Error (must be unique)
- PID reuse within same system: Track PIDs per-system, allocate incrementally, reuse only after explicit termination
Timezone handling:
- All internal timestamps UTC
- Convert to system/format timezone during output
- If system timezone not configured: Use environment.timezone.default
- Support pattern matching for multi-location environments (WS-NYC-, WS-LON-)
- Invalid timezone name: Error with suggestion
Connection to private IP from external actor:
- Allow but warn: "External actor accessing private IP -- consider modeling VPN/proxy/compromised perimeter"
- User should explicitly model network topology to represent VPN/proxy/perimeter devices
Unit Tests (target: 95% coverage, currently 526+ tests passing)
- All Pydantic models: validation, serialization
- State manager: session/process/connection tracking (including thread safety)
- Format validators: constraint DSL evaluation
- Emitters: log format generation (including thread safety)
- Activity generation: persona-based temporal distribution
- Network visibility: sensor placement and traffic filtering
- Ground truth generation
- Time utilities: parsing, duration calculation, timezone handling
- Config loading: env var interpolation, .env file discovery
- CLI commands: argument parsing, error handling
Integration Tests (target: 90% coverage)
- Scenario file loading, validation, and generation end-to-end
- Format definition loading, validation, and application
- Parallel generation across multiple emitters
- Timezone handling through full pipeline
- Medium dataset generation (100 users, 8 hours)
End-to-End Tests (run manually or in release pipeline)
- Complete workflow: init, generate, evaluate
- Multiple dataset sizes and configurations
- Verify output structure, format compliance, consistency
- Performance benchmarks (time to generate, memory usage)
Fixtures:
Required scenario files:
- minimal: 1 user, 1 system, 1 hour, baseline only
- small-realistic: 20 users, 10 systems, 8 hours, baseline only
- attack-single: 50 users, ransomware scenario
- attack-multi: 100 users, credential stuffing + lateral movement
- large-scale: 100 users, 24 hours, multiple log formats
Property Tests:
- All timestamps within specified window
- All LogonIDs referenced have corresponding 4624 events
- All PIDs referenced have corresponding process creation events
- No orphaned connections (all have start events)
Framework: pytest with plugins
- pytest-cov for coverage
- pytest-mock for mocking
- pytest-benchmark for performance tests
Coverage Requirements:
- Overall: 95%+
- Core generation engine: 95%+
- Format definitions & validators: 90%+
- CLI interface: 85%+
- Exclude:
__main__.py, type stubs, test fixtures
-
Three Claude Code skills
/eforge scenario: Guided scenario creation with hybrid interview flow, ENVIRONMENT.md generation, 10-tactic ATT&CK kill chain template/eforge generate: Generation execution with pre-flight validation, error diagnosis, ENVIRONMENT.md copying/eforge validate: Schema and cross-reference validation with auto-fix for simple issues- Developed using /skill-creator with 2 iterations, 30/30 eval assertions passing
-
Pre-built persona library (15 personas)
- Persona files use the exact same YAML schema as the Persona model in scenario files
- Complete set: developer, executive, analyst, sysadmin, help_desk, security_analyst, accountant, sales, hr, marketing, data_analyst, receptionist, intern, project_manager, legal_counsel
- Each with realistic activity patterns, work hours, and risk profiles
-
eforge install-skillscommand- Installs skills, personas, and reference docs to
.claude/commands/(project) or~/.claude/commands/(global) - Bundled as package data via
importlib.resources+ hatch force-include - Handles updates: overwrites changed files, removes stale files
- Installs skills, personas, and reference docs to
-
Documentation
- This PRD
- Scenario authoring reference (
docs/scenario-reference.md) - README with skill-based workflow
-
Core generation engine (implemented in Phases 1-2)
- 7 log formats with emitters
- Persona-based temporal activity distribution
- Network visibility with TAP/SPAN sensor modeling
- Parallel emitter-level generation
- Progress reporting
- Ground truth generation
- Schema + cross-reference validation
- 542+ tests passing
eforge evaluatecommand with 5 quality dimensions, 23 sub-scores, acceptance criteria/eforge evaluateskill for qualitative LLM review- Full details in
docs/data-quality-prd.md
Architectural refactor replacing manual per-emitter field coordination with a canonical SecurityEvent intermediate representation. All 12 generate_* methods build SecurityEvents with composable context dataclasses (HostContext, AuthContext, ProcessContext, NetworkContext, KerberosContext, ShellContext) and dispatch through EventDispatcher. Remaining single-format emissions (eCAR diversity, DNS lookups, engine system traffic) use dispatch_raw(RawLogEntry).
Results: A/B eval comparison showed +1.4 overall score improvement (82.3→83.7). Expert panel found 6 tells fixed by the migration, 0 regressions introduced. OS-aware filtering via can_handle() prevents cross-OS emission bugs by construction.
- Full details in
docs/event-model-prd.md
Phase 4 evaluation revealed that while signal integrity is excellent (100/100), the background noise is too shallow and uniform for the data to pass casual inspection by an experienced analyst. Phase 5 addresses these generator-level limitations in 5 incremental sub-phases.
Problem statement: An experienced threat hunter would identify the data as synthetic within minutes due to: uniform Zeek conn_states, zero UDP/ICMP traffic, only 11 destination IPs, only 2 Windows Event IDs in baseline, metronomic timing, and statistically interchangeable users.
Target outcome: Overall eval score ≥ 85, all hard acceptance criteria pass, no "instant tells" on qualitative review.
- SID generation: Populate
SubjectUserSid/TargetUserSidwith realistic Windows SIDs (S-1-5-21-{domain}-{rid}). Per-domain base SID at engine init, per-user RID mapping, well-known SIDs for system accounts. - Session lifecycle: Baseline activity generates logoff events (Windows 4634, eCAR USER_SESSION/LOGOUT). Sessions have realistic lifetimes with probabilistic termination.
- Zeek conn_state diversity: Replace hardcoded
SF/ShADadfFwith probabilistic selection (SF 85%, S0 5%, REJ 2%, RSTO 3%, etc.) with history strings and byte counts consistent with state. - Process path expansion: Expand from 14 to 50+ unique process paths including OS backbone (svchost, lsass, explorer, csrss) and common applications (browsers, Office, Teams). Per-persona weighting.
- Additional Windows Event IDs: 4625 (failed logon), 4672 (special privileges), 4689 (process termination), 4648 (explicit credential logon), 5156 (firewall allow). Update format definition, templates, and validation.
- Failed logon generation: 5-15% of logon attempts fail with realistic reasons (bad password, locked account, expired password).
- EDR object type expansion (eCAR format): Generate FILE/CREATE, FILE/MODIFY, REGISTRY/MODIFY, FLOW/CONNECT, MODULE/LOAD events alongside existing USER_SESSION and PROCESS types.
- Process termination: Pair 4689 with 4688, track running processes, terminate after realistic durations.
- UDP traffic: DNS queries (UDP 53) preceding TCP connections, NTP sync (UDP 123), DHCP (UDP 67/68), mDNS/LLMNR, QUIC (UDP 443).
- ICMP traffic: Periodic pings between same-segment systems, ICMP unreachable for failed connections.
- Service registry: Internal consistency model — tracks which internal IPs run which services (ports). Declared systems + auto-generated infrastructure. Connection success/failure consistent with whether port is open on the target.
- External IP expansion: Grow from ~9 to 50+ fixed IPs per category plus random generation for CDN/cloud long-tail. Target hundreds of unique destinations per scenario.
- Zeek dns.log format: New format definition for DNS query/response logging.
- System model enhancement: Optional inline
servicesfield on System (e.g.,services: ["dns-client", "ntp-client", "smb"]). Auto-populated from OS type if not specified. Hybrid approach: auto-generate defaults, allow scenario overrides. No separate records for host and services — all in one System definition. - System traffic loop: New generation pass per hour for OS-appropriate system traffic (DNS lookups, NTP sync, Windows Update, SMB browsing). Target ~20-30% of total output.
- System process trees: Generate OS-appropriate boot processes at scenario start (Windows: System→smss→csrss→wininit→services→svchost; Linux: init/systemd→cron, sshd, rsyslogd).
- Scheduled tasks: Periodic system activities (Windows Defender scans, logrotate, package update checks) at regular intervals with slight jitter.
- Soft work-hour ramp: Replace binary on/off with sigmoid curve. Gradual morning ramp (10%→100% over ~1 hour), soft lunch dip (50% not 0%), evening tail (20% for 1-2 hours post-end), occasional late-night activity (1-3% probability).
- Activity clusters: Replace uniform event distribution with burst model. Each "activity" becomes a cluster of 3-15 correlated events over 5-30 seconds (e.g., logon→process spawns→connections). Inter-cluster gaps follow exponential distribution (2-15 minutes).
- Per-user work hour jitter: Randomize each user's start/end/lunch ±30min from persona defaults. Applied once at init, consistent throughout scenario.
- Per-persona behavioral differentiation: Distinct cluster templates per persona type. Developers: long sustained coding sessions. Executives: short frequent email/calendar bursts. Analysts: medium DB-heavy clusters.
| Component | Status |
|---|---|
CLI (eforge init, eforge generate, eforge validate, eforge version, eforge install-skills) |
Complete |
| Scenario Pydantic models | Complete |
| 7 format definitions (YAML) | Complete |
| 7 emitters (Windows, Zeek, ECAR, Syslog, Bash History, Snort, Web) | Complete |
| State manager (sessions, processes, connections) | Complete |
| Persona-based activity generation | Complete |
| Network visibility / sensor modeling | Complete |
| Ground truth generation | Complete |
| Schema + cross-reference validation | Complete |
| Parallel emitter-level generation | Complete |
| Progress reporting (Rich) | Complete |
| Timezone handling | Complete |
| OS-aware activity generation (Windows + Linux) | Complete |
eforge validate command |
Complete |
eforge install-skills command |
Complete |
Skills (/eforge scenario, /eforge generate, /eforge validate) |
Complete |
| Persona library files (15 personas) | Complete |
eforge evaluate command |
Complete (Phase 4) |
| Evaluation framework (5 dimensions, 23 sub-scores) | Complete (Phase 4) |
/eforge evaluate skill |
Complete (Phase 4) |
| Data realism improvements (SIDs, event diversity, protocol mix, timing) | Phase 5 (planned) |
Short-term (post-MVP):
- Checkpointing and resume for long-running generation jobs
- Large dataset optimization (100M+ events, memory-mapped writes)
- Config file inheritance/templating
- Additional log formats (cloud providers, databases)
- PyPI package distribution
Medium-term:
- Poisson/Hawkes process timing model (upgrade from Phase 5.5 activity clusters to self-exciting point process for statistically rigorous inter-arrival distributions)
- Web UI for scenario creation
- Streaming output to SIEM/data lakes
- Log format auto-detection from samples
Long-term:
- OT/ICS environment simulation
- Real-time log streaming mode (not batch generation)
- Collaborative scenario editing
- Scenario marketplace (share/download scenarios)
- Integration with attack frameworks (CALDERA, Atomic Red Team)
- Cloud provider logs (CloudTrail, Azure Activity, GCP Audit)
Must NOT block:
- LLM client integration:
llm/package created when LLM integration is needed (future); Bedrock/OpenAI client plugs in here - Real-time streaming: State manager and emitters designed to work event-by-event, not requiring full dataset in memory
- New log formats: Format engine is data-driven, adding formats requires only a new YAML definition and emitter class
- Web UI: Business logic separated from CLI, can wrap with API layer
- Distributed generation: State can be partitioned (per-user, per-system)
Abstractions to maintain:
# Log Emitter base class (uniform interface for all formats)
class LogEmitter(ABC):
@abstractmethod
def emit_event(self, event: Event, state: StateManager) -> None: ...
@abstractmethod
def flush(self) -> None: ...
# State Manager (encapsulates all runtime state)
class StateManager:
# Thread-safe state access; only StateManager can mutate state
def create_session(self, ...) -> str: ... # Returns LogonID
def get_active_sessions(self) -> dict[str, ActiveSession]: ...
def create_process(self, ...) -> int: ... # Returns PID
# etc.
# Format Definition (declarative, loaded from YAML)
@dataclass
class FormatDefinition:
name: str
common_fields: list[FieldDefinition]
variants: list[VariantDefinition]
output_template: str- Scenario schema versioning (enable backward compatibility)
MVP will NOT:
- Generate bit-perfect binary EVTX files (XML output by default)
- Support binary log formats (Snort uses fast alert format, not pcap)
- Perform network traffic capture simulation (packet-level)
- Simulate actual malware execution (this is synthetic, not sandboxing)
- Generate logs for systems without format definitions
- Guarantee detection rule triggering (depends on SIEM/tool configuration)
- Provide bit-perfect reproducibility (save and reuse scenario files)
- Checkpoint and resume interrupted generation jobs
Performance bounds (MVP):
- Max 1000 users (technical limit, not enforced)
- Max 30-day time windows (technical limit, not enforced)
- Single machine execution (no distributed generation)
- Emitter-level parallelization only (not user-level or time-slice)
MVP is successful if:
- Can generate realistic 8-hour dataset for 100 users in < 30 seconds
- Generated logs pass format validation for all 7 formats
- Cross-log consistency checks pass (no orphaned references)
/eforge scenarioskill can produce valid scenarios for common use cases- 95%+ test coverage achieved
- 3+ external users successfully generate custom scenarios
- Generated logs successfully imported into Splunk/ELK without errors
Quality bar:
- Security researcher can use generated data for detection rule development
- Threat hunter cannot immediately distinguish synthetic from real logs (structural examination)
- Educator can create reproducible lab exercises with specific ground truth
- Generated datasets exhibit realistic temporal patterns and user behaviors