Data & Synthetic Generator¶
Understanding the dataset is critical for interpreting the model’s behavior. Everything in this project is anchored on the synthetic CSV produced by generator/generate_cyber_incidents.py.
Dataset overview¶
- Path:
data/cyber_incidents_simulated.csv - Default volume: 100,000 incidents spanning the 2024 calendar year
- Classes:
phishing,malware,access_abuse,data_exfiltration,policy_violation,web_attack,benign_activity - Artifacts: All notebooks, the CLI, and the baseline model load from this CSV, so keep it in sync with any custom changes.
Schema cheat sheet¶
| Column | Description |
|---|---|
event_id | Sequential identifier generated at write time |
timestamp | Randomized datetime between 2024-01-01 and 2024-12-31 |
log_source | Source system (email gateway, EDR, proxy, firewall, DLP, etc.) |
event_type | One of the seven incident classes, optionally flipped for noise |
severity | info–critical, biased per event type (e.g., ransomware skews higher) |
mitre_technique | Representative ATT&CK technique(s) for that event |
user, device | Named accounts and hosts targeted in the scenario |
src_ip, dest_ip, src_country, dest_country | Network attribution fields |
src_port, dest_port, protocol | Transport context tuned to the event type |
detection_rule | Label for the analytic or alert that fired |
is_true_positive | Simple flag indicating whether the scenario is a true incident |
description | Full narrative with noise, typos, and abbreviated tokens |
description_short | SOC-friendly summary |
description_user_report | “How a user might describe it” phrasing |
short_log | WAF/SIEM style single-line log for quick scanning |
Use the rich text fields (description, description_short, description_user_report, short_log) interchangeably across notebooks to test robustness against slightly different perspectives.
Realism features baked into the generator¶
The generate_cyber_incidents.py script introduces several mechanisms to avoid a too-perfect dataset:
- Confusable classes: Sets such as
web_attackvsaccess_abusevsbenign_activityshare vocabulary on purpose. - Label noise:
LABEL_NOISE_RATEandNEIGHBOR_LABELSflip about 8 % of labels to neighboring classes. - Severity and MITRE biasing: Helper functions select severities, MITRE techniques, and log sources that match the event type.
- Narrative noise: Spelling swaps, abbreviations, and templated verbs keep
descriptionslightly messy whiledescription_shortstays concise. - True/false positive signal:
is_true_positiveis biased towardmalware,data_exfiltration, etc., so analysts can explore downstream filtering rules.
These touches make downstream evaluation (confusion matrices, scenario notebooks, CLI uncertainty logic) feel closer to a real SOC dataset.
Regenerating or customizing the dataset¶
- Regenerate with defaults (100k rows to the same CSV):
- Customize volume or output path by calling
generate_eventsdirectly:
python - <<'PY'
from generator.generate_cyber_incidents import generate_events
generate_events(n_events=25000, outfile="data/cyber_incidents_small.csv")
PY
-
Tweak behavior inside
generator/generate_cyber_incidents.py: -
Adjust
EVENT_TYPESto add/remove classes. - Change
LABEL_NOISE_RATE/NEIGHBOR_LABELSfor more or less confusion. - Extend vocab lists (
DETECTION_RULES,MALWARE_SUBTYPES, etc.) to add new textures.
Re-run any affected notebooks or retrain models after regenerating so artifacts remain consistent with the CSV on disk. The tests/test_model_artifacts.py suite will fail fast if the expected files or class labels go missing.
Production Generation Scripts¶
For large-scale dataset generation (100K+ events) with LLM enhancement, checkpointing, and monitoring, the project includes professional bash orchestration scripts:
Quick Start¶
# Launch 100K event generation (default)
cd generator
./launch_generator.sh
# Monitor in real-time
./monitor_generation.sh --watch
Scripts Overview¶
| Script | Purpose |
|---|---|
launch_generator.sh | Production launcher with LLM integration, checkpointing, background execution |
monitor_generation.sh | Real-time monitoring dashboard with GPU metrics, throughput analysis, ETA calculation |
Key Features¶
launch_generator.sh:
- LLM Enhancement: Optional Llama-2-13B-Chat integration for realistic narrative rewrites (1% of events by default)
- Checkpoint/Resume: Automatic progress saving every 100 events - resume interrupted generations seamlessly
- Background Processing: Uses
nohupfor SSH-safe, unattended operation - Interactive Resume: Prompts when existing files detected (resume vs fresh start)
- Environment Configuration: Automatically sets LLM model paths, rewrite probability, temperature
monitor_generation.sh:
- Process Metrics: Real-time CPU/memory usage, runtime, efficiency (events/CPU%, events/GB)
- Progress Tracking: Visual progress bar, percentage complete, ETA with full timestamp
- GPU Acceleration (Apple Silicon): Metal GPU detection, LLM model info, inference speed (~18.5 tokens/sec)
- Throughput Analysis: Events/second average, trend detection (accelerating/declining/steady)
- Performance Dashboard: Chunk timing, last 5 log entries, file sizes, quick action commands
- Watch Mode: Auto-refresh every N seconds (default 30s)
- Simple Mode: ASCII symbols for problematic terminals (
--simpleor--simple-color)
Usage Examples¶
# Generate 50K events with custom name
./launch_generator.sh 50000 training_data
# Fresh start (delete existing files)
./launch_generator.sh 100000 cyber_incidents_simulated --fresh
# Monitor with auto-refresh every 10 seconds
./monitor_generation.sh --watch 10
# Monitor custom dataset in simple mode
./monitor_generation.sh my_dataset --simple-color --watch
LLM Configuration (in launch_generator.sh)¶
export NLP_TRIAGE_LLM_GENERATOR=1 # Enable LLM generation
export NLP_TRIAGE_LLM_REWRITE_PROB=0.01 # 1% rewrite rate (balanced quality/speed)
export NLP_TRIAGE_LLM_TEMPERATURE=0.2 # Focused, deterministic output
export NLP_TRIAGE_LLM_MAX_RETRIES=3 # Retry failed LLM calls
export NLP_TRIAGE_LLM_BACKEND="models/llama-2-13b-chat.Q5_K_S.gguf"
Performance:
- With LLM (1% rewrite): ~3-5 events/sec (100K events in 6-9 hours)
- Without LLM: ~50-100 events/sec (100K events in 20-30 minutes)
Output Files¶
All outputs written to data/ directory:
| File | Description |
|---|---|
{dataset_name}.csv | Main dataset (100K rows ~99MB uncompressed) |
{dataset_name}.log | Detailed generation log with timestamps |
{dataset_name}_checkpoint.json | Progress state for resume capability |
{dataset_name}_llm_report.json | LLM usage statistics (rewrites attempted/applied, success rate, timing) |
nohup_output.log | Raw stdout/stderr from background process |
Checkpointing System¶
Progress saved every 100 events:
{
"last_completed_event": 25000,
"total_events": 100000,
"chunks_written": 250,
"timestamp": "2024-11-22T10:30:15",
"status": "running"
}
Resume behavior: Generator reads checkpoint, continues from last_completed_event + 1, appends to CSV without re-writing header.
Monitoring Dashboard Sample¶
🛠️ CYBERSECURITY DATASET GENERATION MONITOR
Dataset: cyber_incidents_simulated
Started: Fri Nov 22 08:15:30 CST 2024
ETA: Fri Nov 22 14:32:15 CST 2024
📈 PROCESS STATUS
✅ Generation process RUNNING (PID: 12345)
CPU Usage: 125.3% (7.8% per core, 16 cores)
Memory Usage: 8.2% (2.1GB / 32.0GB)
🎮 GPU: Metal (Apple M2 Max - 38 cores)
LLM Model: llama-2-13b-chat.Q5_K_S.gguf
Enhancement: 0.8% of events (99.4% success)
GPU Throughput: 375.2/hr (9.6s avg per rewrite)
Efficiency: 1,234 events/CPU%, 15,600 events/GB
📈 PROGRESS STATUS
🚀 25000/100000 (25.0%)
[████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 25.0%
📈 Throughput: Steady
⚡ PERFORMANCE
Generation runtime: 2h 15m 42s
Time per event: 0.3s
Events/second: 3.088
Estimated time remaining: 6h 15m
Process Management¶
# Check if generation is running
pgrep -f "generate_cyber_incidents.py"
# Kill running generation
pkill -f generate_cyber_incidents
# View real-time logs
tail -f data/cyber_incidents_simulated.log
# Inspect checkpoint
cat data/cyber_incidents_simulated_checkpoint.json | jq
Customization¶
Environment variables (set before launching):
# Higher LLM rewrite rate (10% of events)
export NLP_TRIAGE_LLM_REWRITE_PROB=0.10
# More creative LLM output
export NLP_TRIAGE_LLM_TEMPERATURE=0.7
# Verbose LLM debugging
export NLP_TRIAGE_LLM_DEBUG=1
# Custom model path
export NLP_TRIAGE_LLM_BACKEND="/path/to/custom-model.gguf"
Direct Python call (manual control):
cd generator
python generate_cyber_incidents.py \
--n-events 50000 \
--outfile ../data/my_dataset.csv \
--start-date 2024-01-01 \
--end-date 2024-12-31 \
--chunk-size 100 \
--use-llm \
--rewrite-report ../data/my_dataset_llm_report.json
Troubleshooting¶
Slow generation: Lower NLP_TRIAGE_LLM_REWRITE_PROB (default 0.01 = 1%) or disable LLM entirely by commenting out export NLP_TRIAGE_LLM_GENERATOR=1 in launch_generator.sh.
Resume not working: Check checkpoint file exists (ls -lh data/{dataset}_checkpoint.json) and wasn't deleted. Ensure dataset name matches.
Monitor shows "No active generation": Specify dataset name explicitly: ./monitor_generation.sh my_custom_dataset --watch
LLM model not found: Download Llama-2-13B-Chat GGUF (Q5_K_S quantization, ~7.5GB) to models/ directory, or disable LLM in launcher script.
Complete Documentation¶
For comprehensive usage, GPU metrics, performance optimization, example workflows, and advanced configurations, see Production Generation Guide.