Production Dataset Generation¶

This guide covers production-scale dataset generation using LLM-enhanced scripts for creating realistic, large-scale cybersecurity incident datasets.

Overview¶

The production generation system consists of two key bash scripts that orchestrate the Python generator with advanced features:

launch_generator.sh - Launches dataset generation with checkpointing, LLM integration, and background processing
monitor_generation.sh - Real-time monitoring dashboard with GPU metrics, throughput analysis, and performance tracking

These scripts are designed for long-running, unattended dataset generation (hours to days) with automatic resume capability.

Quick Start¶

Basic Generation (100K events, default settings)¶

cd generator
./launch_generator.sh

Custom Dataset Size and Name¶

# Generate 50,000 events with custom name
./launch_generator.sh 50000 my_custom_dataset

# Fresh start (delete existing files)
./launch_generator.sh 100000 cyber_incidents_simulated --fresh

Monitor Progress¶

# Single snapshot
./monitor_generation.sh

# Auto-refresh every 30 seconds
./monitor_generation.sh --watch

# Auto-refresh every 10 seconds
./monitor_generation.sh --watch 10

# Monitor custom dataset
./monitor_generation.sh my_custom_dataset --watch

launch_generator.sh¶

Purpose¶

Launches the Python dataset generator as a background process with:

LLM-enhanced generation: Optional Llama-2-13B-Chat for realistic narrative rewrites
Checkpoint/resume: Automatic progress saving and resume capability
Background processing: Runs via nohup for SSH-safe operation
Error recovery: Handles interruptions gracefully

Usage¶

./launch_generator.sh [events] [dataset_name] [--fresh]

Parameters¶

Parameter	Description	Default
`events`	Number of incidents to generate	100000
`dataset_name`	Output dataset name (no `.csv` extension)	cyber_incidents_simulated
`--fresh`	Delete existing files and start fresh	Resume from checkpoint

Examples¶

# Default: 100K events, resume from checkpoint
./launch_generator.sh

# 50K events with custom name
./launch_generator.sh 50000 training_data

# 200K events, fresh start (delete old files)
./launch_generator.sh 200000 large_dataset --fresh

# Resume existing generation
./launch_generator.sh 100000 cyber_incidents_simulated

LLM Configuration¶

The script automatically configures the LLM with production-optimized settings:

export NLP_TRIAGE_LLM_GENERATOR=1             # Enable LLM generation
export NLP_TRIAGE_LLM_REWRITE_PROB=0.01       # 1% of events get LLM enhancement
export NLP_TRIAGE_LLM_TEMPERATURE=0.2         # Focused, deterministic output
export NLP_TRIAGE_LLM_MAX_RETRIES=3           # Retry failed LLM calls
export NLP_TRIAGE_LLM_DEBUG=0                 # Disable verbose logging
export NLP_TRIAGE_LLM_BACKEND="models/llama-2-13b-chat.Q5_K_S.gguf"

Model: Llama-2-13B-Chat (7.5GB GGUF quantized)

Requires: ~10GB RAM (M1/M2 Mac with Metal acceleration)
Performance: ~10-20 tokens/sec on Apple Silicon

Output Files¶

File	Description
`data/{dataset_name}.csv`	Main dataset (100K rows ~99MB)
`data/{dataset_name}.log`	Detailed generation log with timestamps
`data/{dataset_name}_checkpoint.json`	Progress state for resume capability
`data/{dataset_name}_llm_report.json`	LLM usage statistics and metrics
`data/nohup_output.log`	Raw stdout/stderr from background process

Interactive Resume Handling¶

If existing files are detected without --fresh, you'll be prompted:

⚠️  WARNING: Existing files detected:
   - data/cyber_incidents_simulated.csv
   - data/cyber_incidents_simulated_checkpoint.json

Choose an option:
   r) Resume from checkpoint (default)
   f) Fresh start (delete existing files)
   q) Quit
Choice (r/f/q):

Process Management¶

# Check if generation is running
pgrep -f "generate_cyber_incidents.py"

# Kill running generation
pkill -f generate_cyber_incidents

# View background output
tail -f data/nohup_output.log

# View detailed logs
tail -f data/cyber_incidents_simulated.log

monitor_generation.sh¶

Purpose¶

Real-time monitoring dashboard providing:

Process status: CPU, memory, runtime metrics
Progress tracking: Events completed, ETA, throughput
GPU acceleration: Metal/Apple Silicon metrics when using LLM
Performance analysis: Throughput trends, efficiency metrics
File status: Dataset size, log sizes

Usage¶

./monitor_generation.sh [dataset_name] [options]

Options¶

Option	Description
`dataset_name`	Name of dataset to monitor (default: cyber_incidents_simulated)
`--watch`, `-w [interval]`	Auto-refresh mode (default 30s interval)
`--simple`, `-s`	ASCII symbols, no colors (for problematic terminals)
`--simple-color`	ASCII symbols WITH colors (best for SSH/tmux)
`--help`, `-h`	Show help message

Examples¶

# Single snapshot
./monitor_generation.sh

# Auto-refresh every 30 seconds
./monitor_generation.sh --watch

# Auto-refresh every 10 seconds
./monitor_generation.sh --watch 10

# Monitor custom dataset
./monitor_generation.sh my_dataset --watch

# Simple mode for SSH/tmux
./monitor_generation.sh --simple-color --watch

Dashboard Sections¶

1. Process Status¶

PID: Background process identifier
CPU Usage: Per-core and total CPU utilization
Memory Usage: Process memory consumption vs total system RAM
Runtime: Process uptime
GPU Acceleration (Apple Silicon only):
Metal GPU detection and core count
LLM model information (Llama-2-13B-Chat)
GPU throughput metrics (rewrites/hr, tokens/sec)
Enhancement rate (% of events LLM-enhanced)

2. Progress Status¶

Real-time progress: Current/Total events with percentage
Progress bar: Visual 50-character bar with completion indicator
Generation status: Generating/Initializing/LLM loading
Throughput trend: Accelerating/Declining/Steady analysis
ETA: Estimated completion time with full timestamp

3. File Status¶

Dataset file: Size in human-readable format (MB/GB) with event count
Log file: Generation log size
Nohup output: Background process output size

4. Recent Activity¶

Last 5 log entries: Tail of detailed log
Chunk analysis:
Average chunk interval (time between chunks)
Last chunk interval (most recent timing)
Total chunks completed

5. Performance Metrics¶

Generation runtime: Total time since start
Time per event: Average seconds per incident
Events/second: Throughput rate
Progress velocity: Percentage completion per hour
Estimated time remaining: Hours/minutes until completion
Resource efficiency: Events per CPU%, Events per GB RAM

Sample Output¶

🛠️  CYBERSECURITY DATASET GENERATION MONITOR
==============================================
Dataset: cyber_incidents_simulated
Started: Fri Nov 22 08:15:30 CST 2024
ETA: Fri Nov 22 14:32:15 CST 2024

📈  PROCESS STATUS
-----------------
✅  Generation process RUNNING (PID: 12345)
   CPU Usage: 125.3% (7.8% per core, 16 cores total)
   Memory Usage: 8.2% (2.1GB of 32.0GB total)
   Runtime: 2:15:42 (process uptime)
   🎮 GPU Acceleration: Metal (Apple M2 Max - 38 cores)
      LLM Model: llama-2-13b-chat.Q5_K_S.gguf
      LLM Activity: 845/850 rewrites processed
      Enhancement: 0.8% of events (99.4% success rate)
      GPU Throughput: 375.2/hr (9.6s avg per rewrite)
      Inference Speed: ~18.5 tokens/sec
   Efficiency: 1,234 events/CPU%, 15,600 events/GB

📈  PROGRESS STATUS
------------------
🚀 Generation active: 25000/100000 (25.0%)
   [████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 25.0% Complete
   Status: Generating events...
   📈 Throughput: Steady

📁  FILE STATUS
--------------
✅ Dataset: 24.5M (25000 events)
✅ Log file: 1.2M
✅ Nohup output: 45K

📝  RECENT ACTIVITY
------------------
Last 5 log entries:
   2024-11-22 10:30:15 - Writing chunk 250
   2024-11-22 10:30:12 - Generating events: 25000/100000
   2024-11-22 10:29:58 - Writing chunk 249
   ...

📦 Chunk Analysis:
   Average chunk interval: 5m 23s
   Last chunk interval: 5m 15s
   Total chunks completed: 250

⚡ PERFORMANCE
-------------
   Generation runtime: 2h 15m 42s
   Started: 2024-11-22 08:15:30
   Progress velocity: 11.1%/hour
   Time per event: 0.3s
   Estimated time remaining: 6h 15m
   Events/second: 3.088 (avg)

🛠️  QUICK ACTIONS
----------------
Monitor real-time: tail -f ../data/cyber_incidents_simulated.log
Check progress:    cat ../data/cyber_incidents_simulated_checkpoint.json | jq
Kill process:      pkill -f generate_cyber_incidents
View nohup output: tail -f ../data/nohup_output.log

🚀 STATUS: Generation active (25.0% complete)

GPU Metrics (Apple Silicon)¶

When LLM enhancement is enabled on Apple Silicon Macs, the monitor shows:

GPU Model: e.g., "Apple M2 Max - 38 cores"
LLM Model: Loaded model file (llama-2-13b-chat.Q5_K_S.gguf)
Enhancement Rate: Percentage of events LLM-enhanced
GPU Throughput: Rewrites per hour, average time per rewrite
Inference Speed: Estimated tokens/second
Success Rate: % of successful LLM rewrites

The monitor calculates throughput trends by sampling recent progress:

Accelerating: Throughput increasing (shows +X%)
Steady: Consistent throughput
Declining: Throughput decreasing (shows -X%)

Checkpointing System¶

How It Works¶

The generator saves progress every 100 events (configurable via --chunk-size):

{
  "last_completed_event": 25000,
  "total_events": 100000,
  "chunks_written": 250,
  "timestamp": "2024-11-22T10:30:15",
  "status": "running",
  "generation_start_time": "2024-11-22T08:15:30"
}

Resume Behavior¶

When you restart the generator:

Checks for existing checkpoint file
Reads last_completed_event
Continues from event 25001
Appends to existing CSV file

No data is duplicated or lost - the CSV header is NOT re-written on resume.

Manual Checkpoint Inspection¶

# View checkpoint status
cat data/cyber_incidents_simulated_checkpoint.json | jq

# Check progress percentage
jq -r '"\(.last_completed_event)/\(.total_events) = \((.last_completed_event / .total_events * 100) | tostring + "%")"' \
  data/cyber_incidents_simulated_checkpoint.json

LLM-Enhanced Generation¶

What Gets Enhanced?¶

With NLP_TRIAGE_LLM_REWRITE_PROB=0.01 (default), approximately 1% of events receive LLM enhancement:

Narrative rewriting: More realistic phrasing and SOC language
Technical detail addition: Specific IPs, file hashes, timestamps
Contextual enrichment: Background information, user quotes
Complexity variation: Mix of terse/verbose descriptions

LLM Report¶

After generation, check data/{dataset_name}_llm_report.json:

{
  "rewrites_attempted": 1050,
  "rewrites_applied": 1043,
  "rewrite_success_rate": 0.993,
  "average_rewrite_time_seconds": 9.6,
  "total_llm_time_seconds": 10012.8,
  "model_path": "models/llama-2-13b-chat.Q5_K_S.gguf",
  "temperature": 0.2,
  "max_retries": 3
}

Metrics explanation:

rewrites_attempted: Total LLM calls triggered (1% of 100K = ~1000)
rewrites_applied: Successfully enhanced events
success_rate: Percentage of successful LLM rewrites (typically 99%+)
average_rewrite_time: Seconds per LLM call (~10s for 13B model)
total_llm_time: Cumulative GPU inference time

Disabling LLM¶

For faster generation without LLM enhancement:

# Edit launch_generator.sh and comment out:
# export NLP_TRIAGE_LLM_GENERATOR=1

# Or manually run without LLM:
python generate_cyber_incidents.py --n-events 100000 --outfile ../data/dataset.csv

Speed comparison:

With LLM (1% rewrite): ~3-5 events/sec (~6-9 hours for 100K)
Without LLM: ~50-100 events/sec (~20-30 minutes for 100K)

Advanced Usage¶

Custom Python Call (Manual Control)¶

cd generator

# Direct Python call with all parameters
python generate_cyber_incidents.py \
  --n-events 50000 \
  --outfile ../data/my_dataset.csv \
  --start-date 2024-01-01 \
  --end-date 2024-12-31 \
  --chunk-size 100 \
  --log-file ../data/my_dataset.log \
  --checkpoint-file ../data/my_dataset_checkpoint.json \
  --use-llm \
  --rewrite-report ../data/my_dataset_llm_report.json

Environment Variable Customization¶

# Higher LLM rewrite rate (10% of events)
export NLP_TRIAGE_LLM_REWRITE_PROB=0.10

# More creative LLM output
export NLP_TRIAGE_LLM_TEMPERATURE=0.7

# Verbose LLM debugging
export NLP_TRIAGE_LLM_DEBUG=1

# Custom model path
export NLP_TRIAGE_LLM_BACKEND="/path/to/custom-model.gguf"

# Then launch
./launch_generator.sh

Multiple Concurrent Generations¶

You can run multiple generators simultaneously with different dataset names:

# Terminal 1: Generate training dataset
./launch_generator.sh 100000 training_data --fresh

# Terminal 2: Generate validation dataset
./launch_generator.sh 20000 validation_data --fresh

# Terminal 3: Monitor both
./monitor_generation.sh training_data --watch &
./monitor_generation.sh validation_data --watch

Note: Each generator will compete for CPU/GPU resources. On Apple Silicon, running 2 LLM-enhanced generators will slow both down ~50%.

Troubleshooting¶

Generator Won't Start¶

Error: Generation already running (PID: 12345)

Solution:

# Kill existing process
pkill -f generate_cyber_incidents

# Or use --fresh to auto-kill
./launch_generator.sh 100000 my_dataset --fresh

LLM Model Not Found¶

Error: LLM model not found: models/llama-2-13b-chat.Q5_K_S.gguf

Solution:

# Download the model (7.5GB)
cd models
# Use your preferred download method for Llama-2-13B-Chat GGUF Q5_K_S quantization

# Or disable LLM in launch_generator.sh
# Comment out: export NLP_TRIAGE_LLM_GENERATOR=1

Slow Generation Speed¶

Issue: Only 0.5 events/sec with LLM enabled

Causes:

High rewrite probability: Check NLP_TRIAGE_LLM_REWRITE_PROB (should be 0.01-0.05)
Large model: 13B model is slower than 7B (but higher quality)
Memory swapping: System RAM exhausted, using swap
Other processes: GPU/CPU contention

Solutions:

# Lower rewrite probability
export NLP_TRIAGE_LLM_REWRITE_PROB=0.01  # 1% instead of 10%

# Use smaller 7B model (faster, lower quality)
export NLP_TRIAGE_LLM_BACKEND="models/llama-2-7b-chat.Q5_K_S.gguf"

# Disable LLM entirely
# Comment out: export NLP_TRIAGE_LLM_GENERATOR=1

# Close other applications to free RAM/GPU

Monitor Shows "No active generation"¶

Issue: Process is running but monitor doesn't detect it

Cause: Process name mismatch (using custom dataset name)

Solution:

# Specify dataset name explicitly
./monitor_generation.sh my_custom_dataset --watch

Resume Not Working¶

Issue: Generator starts from 0 instead of resuming

Causes:

Checkpoint file corrupted
Used --fresh flag
Dataset name mismatch

Solution:

# Check checkpoint exists
ls -lh data/cyber_incidents_simulated_checkpoint.json

# View checkpoint contents
cat data/cyber_incidents_simulated_checkpoint.json | jq

# If corrupted, delete and start fresh
rm data/cyber_incidents_simulated_checkpoint.json
./launch_generator.sh 100000 cyber_incidents_simulated --fresh

Performance Optimization¶

Maximizing Speed (No LLM)¶

# Disable LLM in launch_generator.sh
# Comment out: export NLP_TRIAGE_LLM_GENERATOR=1

# Larger chunk size (fewer disk writes)
# Edit generate_cyber_incidents.py: --chunk-size 1000

# Expected: ~50-100 events/sec (100K events in 20-30 minutes)

Balancing Quality vs Speed¶

Configuration	Speed	Quality	Time for 100K
No LLM	50-100 evt/s	Good	20-30 min
LLM 1%	3-5 evt/s	Excellent	6-9 hours
LLM 5%	2-3 evt/s	Best	10-15 hours
LLM 10%	1-2 evt/s	Overkill	15-24 hours

Recommendation: Use 1-2% LLM enhancement for production datasets - provides excellent quality without excessive generation time.

Resource Requirements¶

Minimum:

8GB RAM
2-core CPU
5GB disk space (for 100K events)

Recommended (with LLM):

16GB RAM (10GB for model + 6GB for system)
4+ core CPU (Apple Silicon M1/M2 ideal)
10GB disk space (dataset + logs + checkpoints)
macOS with Metal support (for GPU acceleration)

Optimal (large-scale generation):

32GB+ RAM
8+ core CPU
50GB+ disk space (multiple datasets)
Apple Silicon M1 Max/M2 Max/M3 (faster GPU)

Example Workflows¶

Quick Test Dataset (1K events, no LLM)¶

# Edit launch_generator.sh: comment out LLM_GENERATOR
./launch_generator.sh 1000 test_data --fresh
# Wait ~20-30 seconds
./monitor_generation.sh test_data

Production Dataset (100K events, 1% LLM)¶

# Default configuration (already set in launch_generator.sh)
./launch_generator.sh 100000 cyber_incidents_simulated --fresh

# Monitor in another terminal
./monitor_generation.sh --watch 60  # Refresh every minute

# Expected completion: 6-9 hours
# Check completion: grep "completed successfully" data/cyber_incidents_simulated.log

Large-Scale Dataset (500K events, overnight)¶

# Launch before leaving for the day
./launch_generator.sh 500000 large_dataset --fresh

# Monitor remotely via SSH (use simple mode)
ssh your-machine
cd /path/to/generator
./monitor_generation.sh large_dataset --simple-color --watch

# Expected completion: 30-45 hours with 1% LLM

Best Practices¶

Always use --watch mode when monitoring long-running generations
Check disk space before large generations (df -h)
Use checkpoints - never start fresh unless necessary
Review LLM report after completion to validate enhancement success rate
Test with small datasets before committing to 100K+ event generations
Monitor GPU temperature on long runs (use Activity Monitor on macOS)
Use nohup/background execution for SSH sessions
Keep rewrite probability low (1-2%) for good quality without excessive time
Archive completed datasets with LLM reports for reproducibility
Document your generation settings in dataset metadata files

Next Steps¶

After generating your dataset:

Validate dataset quality: Check notebooks/01_explore_dataset.ipynb
Train models: Follow notebooks 02-03 for feature engineering and baseline model
Customize generator: See data-and-generator.md for vocabulary/template modifications
Share your dataset: Consider contributing anonymized versions back to the project

For dataset structure and customization details, see Data & Generator Guide.

Production Dataset Generation¶

Overview¶

Quick Start¶

Basic Generation (100K events, default settings)¶

Custom Dataset Size and Name¶

Monitor Progress¶

launch_generator.sh¶

Purpose¶

Usage¶

Parameters¶

Examples¶

LLM Configuration¶

Output Files¶

Interactive Resume Handling¶

Process Management¶

monitor_generation.sh¶

Purpose¶

Usage¶

Options¶

Examples¶

Dashboard Sections¶

1. Process Status¶

2. Progress Status¶

3. File Status¶

4. Recent Activity¶

5. Performance Metrics¶

Sample Output¶

GPU Metrics (Apple Silicon)¶

Performance Trending¶

Checkpointing System¶

How It Works¶

Resume Behavior¶

Manual Checkpoint Inspection¶

LLM-Enhanced Generation¶

What Gets Enhanced?¶

LLM Report¶

Disabling LLM¶

Advanced Usage¶

Custom Python Call (Manual Control)¶

Environment Variable Customization¶

Multiple Concurrent Generations¶

Troubleshooting¶

Generator Won't Start¶

LLM Model Not Found¶

Slow Generation Speed¶

Monitor Shows "No active generation"¶

Resume Not Working¶

Performance Optimization¶

Maximizing Speed (No LLM)¶

Balancing Quality vs Speed¶

Resource Requirements¶

Example Workflows¶

Quick Test Dataset (1K events, no LLM)¶

Production Dataset (100K events, 1% LLM)¶

Large-Scale Dataset (500K events, overnight)¶

Best Practices¶

Next Steps¶