LLM Integration¶

Overview¶

The project uses llama.cpp for privacy-first local LLM inference in two key areas:

Dataset Generation - Enhancing synthetic narratives
Second-Opinion Triage - Assisting with uncertain classifications

Setup¶

Install llama-cpp-python¶

# For Apple Silicon (Metal acceleration)
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python

# For CUDA GPUs
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python

# CPU only
pip install llama-cpp-python

Download Model¶

Choose a GGUF model and download via Hugging Face CLI (authenticate if required):

# Create models directory
mkdir -p models

# Option A: Llama 3.1 8B Instruct (higher quality)
huggingface-cli download TheBloke/Llama-3.1-8B-Instruct-GGUF \
  Llama-3.1-8B-Instruct-Q6_K.gguf --local-dir models

# Option B: Mistral 7B Instruct v0.2 (mid-size)
huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.2-GGUF \
  mistral-7b-instruct-v0.2.Q6_K.gguf --local-dir models

# Option C: TinyLlama 1.1B Chat (small, CPU-friendly)
huggingface-cli download TinyLlama/TinyLlama-1.1B-Chat-v1.0-GGUF \
  TinyLlama-1.1B-Chat-v1.0.Q6_K.gguf --local-dir models

Configure Environment¶

# Preferred (CLI):
export TRIAGE_LLM_MODEL="$(pwd)/models/Llama-3.1-8B-Instruct-Q6_K.gguf"
export TRIAGE_LLM_DEBUG=1

# Alternative (library client compatibility):
export NLP_TRIAGE_LLM_BACKEND="$TRIAGE_LLM_MODEL"

Dataset Generation¶

Enable LLM Rewriting¶

python generator/generate_cyber_incidents.py \
  --n-events 5000 \
  --use-llm \
  --rewrite-report audit.json

Rewrite Parameters¶

export NLP_TRIAGE_LLM_REWRITE_PROB=0.30  # 30% of incidents rewritten
export NLP_TRIAGE_LLM_TEMPERATURE=0.2     # Focused generation
export NLP_TRIAGE_LLM_MAX_RETRIES=3       # Error recovery

How It Works¶

Generator creates baseline narrative
LLM probabilistically rewrites (30% by default)
Sanitization removes artifacts
Validation ensures quality
Audit log tracks statistics

Second-Opinion Triage¶

CLI Usage¶

# Single incident
nlp-triage --llm-second-opinion "Suspicious activity detected"

# Bulk processing
nlp-triage --llm-second-opinion \
  --input-file incidents.txt \
  --output-file results.jsonl

Streamlit UI¶

Enable "LLM Second Opinion" toggle in the sidebar.

Guardrails¶

The second-opinion engine includes multiple safety layers:

JSON Parsing - Structured output validation
SOC Keyword Intelligence - Domain-specific validation
Label Normalization - Maps variations to canonical labels
Confidence Filtering - Only engages on uncertain cases
Timeout Protection - Prevents hanging on bad inputs

Advanced Configuration¶

Model Parameters¶

# Context window (tokens)
export TRIAGE_LLM_CTX=4096

# Max generation tokens
export TRIAGE_LLM_MAX_TOKENS=512

# Temperature (creativity)
export TRIAGE_LLM_TEMP=0.2

# Top-p sampling
export TRIAGE_LLM_TOP_P=0.9

Backend Selection¶

# Absolute path to model
export NLP_TRIAGE_LLM_BACKEND=/full/path/to/model.gguf

Performance Tuning¶

CPU Optimization¶

# Increase thread count
export OMP_NUM_THREADS=8

# Use BLAS
CMAKE_ARGS="-DLLAMA_BLAS=ON" pip install llama-cpp-python

GPU Acceleration¶

# CUDA
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python

# Metal (Apple Silicon)
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python

Memory Management¶

# Reduce context for lower memory
export TRIAGE_LLM_CTX=2048

# Use quantized models (Q4, Q5)
# Smaller = less accurate but faster

Troubleshooting¶

Import Errors¶

# Reinstall with correct flags
pip uninstall llama-cpp-python
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python

Slow Inference¶

Use quantized models (Q5_K_S recommended)
Enable GPU acceleration
Reduce context window
Lower max tokens

Out of Memory¶

Use smaller model (7B instead of 13B)
Reduce context window
Close other applications

Debug Mode¶

export TRIAGE_LLM_DEBUG=1
nlp-triage --llm-second-opinion "test incident"

Best Practices¶

✅ Use quantized models (Q5_K_S or Q4_K_M)
✅ Enable GPU acceleration when available
✅ Set appropriate context window for your RAM
✅ Monitor resource usage during generation
✅ Use lower temperature for focused outputs
✅ Enable debug mode for troubleshooting

❌ Don't use unquantized models (too large)
❌ Don't set context > 8192 without sufficient RAM
❌ Don't ignore timeout warnings
❌ Don't disable guardrails in production

See Production Generation for monitoring LLM-enhanced dataset creation.