Skip to content

Glossary

Terms

Checkpoint

Persistent state file enabling resumable dataset generation. Contains progress, timestamps, and metadata.

Confidence Threshold

Minimum probability score required for a label. Below threshold triggers "uncertain" fallback.

Difficulty Mode

Setting controlling strictness of uncertainty handling:

  • default: Standard thresholds
  • soc-medium: Moderate strictness
  • soc-hard: Maximum strictness for edge cases

GGUF

GPT-Generated Unified Format. File format for quantized LLM models used by llama.cpp.

Guardrails

Safety mechanisms preventing LLM hallucinations:

  • JSON parsing validation
  • SOC keyword intelligence
  • Label normalization
  • Timeout protection

Incident

Security event requiring investigation. Represented as natural language narrative in this system.

LLM

Large Language Model. Used for dataset enhancement and second-opinion triage.

MITRE ATT&CK®

Knowledge base of adversary tactics and techniques. Used for incident enrichment and mapping.

Quantization

Model compression technique reducing size/memory at slight accuracy cost (Q4, Q5, Q8).

Rewrite Engine

LLM-powered component enhancing synthetic narratives during generation.

Second Opinion

LLM-assisted classification for uncertain cases. Provides alternative perspective with rationale.

SOC

Security Operations Center. Team responsible for monitoring and responding to security incidents.

Synthetic Data

Artificially generated training data. This project uses 100% synthetic incidents.

TF-IDF

Term Frequency-Inverse Document Frequency. Statistical measure for text feature extraction.

Triage

Process of categorizing and prioritizing incidents based on severity and type.

Uncertain

Label assigned when classifier confidence is below threshold. Indicates manual review needed.

Vectorization

Conversion of text to numerical representations for ML processing.

Acronyms

Acronym Full Term
API Application Programming Interface
ATT&CK Adversarial Tactics, Techniques & Common Knowledge
CI/CD Continuous Integration/Continuous Deployment
CLI Command-Line Interface
CPU Central Processing Unit
CSV Comma-Separated Values
EDR Endpoint Detection and Response
ETA Estimated Time of Arrival
GGUF GPT-Generated Unified Format
GPU Graphics Processing Unit
IR Incident Response
JSON JavaScript Object Notation
JSONL JSON Lines (one JSON object per line)
LLM Large Language Model
MITRE Massachusetts Institute of Technology Research and Engineering
ML Machine Learning
NLP Natural Language Processing
RAM Random Access Memory
SIEM Security Information and Event Management
SOAR Security Orchestration, Automation and Response
SOC Security Operations Center
TF-IDF Term Frequency-Inverse Document Frequency
UI User Interface
URL Uniform Resource Locator

File Extensions

Extension Description
.csv Comma-separated values dataset file
.joblib Serialized scikit-learn model or vectorizer
.json JSON configuration or results file
.jsonl JSON Lines (bulk results, one record per line)
.log Text log file
.md Markdown documentation file
.gguf Quantized LLM model file
.py Python source code file
.sh Shell script file
.yml YAML configuration file

For technical terms, see Architecture and Model Information.