LLM Integration¶
Overview¶
The project uses llama.cpp for privacy-first local LLM inference in two key areas:
- Dataset Generation - Enhancing synthetic narratives
- Second-Opinion Triage - Assisting with uncertain classifications
Setup¶
Install llama-cpp-python¶
# For Apple Silicon (Metal acceleration)
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
# For CUDA GPUs
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python
# CPU only
pip install llama-cpp-python
Download Model¶
Choose a GGUF model and download via Hugging Face CLI (authenticate if required):
# Create models directory
mkdir -p models
# Option A: Llama 3.1 8B Instruct (higher quality)
huggingface-cli download TheBloke/Llama-3.1-8B-Instruct-GGUF \
Llama-3.1-8B-Instruct-Q6_K.gguf --local-dir models
# Option B: Mistral 7B Instruct v0.2 (mid-size)
huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.2-GGUF \
mistral-7b-instruct-v0.2.Q6_K.gguf --local-dir models
# Option C: TinyLlama 1.1B Chat (small, CPU-friendly)
huggingface-cli download TinyLlama/TinyLlama-1.1B-Chat-v1.0-GGUF \
TinyLlama-1.1B-Chat-v1.0.Q6_K.gguf --local-dir models
Configure Environment¶
# Preferred (CLI):
export TRIAGE_LLM_MODEL="$(pwd)/models/Llama-3.1-8B-Instruct-Q6_K.gguf"
export TRIAGE_LLM_DEBUG=1
# Alternative (library client compatibility):
export NLP_TRIAGE_LLM_BACKEND="$TRIAGE_LLM_MODEL"
Dataset Generation¶
Enable LLM Rewriting¶
python generator/generate_cyber_incidents.py \
--n-events 5000 \
--use-llm \
--rewrite-report audit.json
Rewrite Parameters¶
export NLP_TRIAGE_LLM_REWRITE_PROB=0.30 # 30% of incidents rewritten
export NLP_TRIAGE_LLM_TEMPERATURE=0.2 # Focused generation
export NLP_TRIAGE_LLM_MAX_RETRIES=3 # Error recovery
How It Works¶
- Generator creates baseline narrative
- LLM probabilistically rewrites (30% by default)
- Sanitization removes artifacts
- Validation ensures quality
- Audit log tracks statistics
Second-Opinion Triage¶
CLI Usage¶
# Single incident
nlp-triage --llm-second-opinion "Suspicious activity detected"
# Bulk processing
nlp-triage --llm-second-opinion \
--input-file incidents.txt \
--output-file results.jsonl
Streamlit UI¶
Enable "LLM Second Opinion" toggle in the sidebar.
Guardrails¶
The second-opinion engine includes multiple safety layers:
- JSON Parsing - Structured output validation
- SOC Keyword Intelligence - Domain-specific validation
- Label Normalization - Maps variations to canonical labels
- Confidence Filtering - Only engages on uncertain cases
- Timeout Protection - Prevents hanging on bad inputs
Advanced Configuration¶
Model Parameters¶
# Context window (tokens)
export TRIAGE_LLM_CTX=4096
# Max generation tokens
export TRIAGE_LLM_MAX_TOKENS=512
# Temperature (creativity)
export TRIAGE_LLM_TEMP=0.2
# Top-p sampling
export TRIAGE_LLM_TOP_P=0.9
Backend Selection¶
Performance Tuning¶
CPU Optimization¶
# Increase thread count
export OMP_NUM_THREADS=8
# Use BLAS
CMAKE_ARGS="-DLLAMA_BLAS=ON" pip install llama-cpp-python
GPU Acceleration¶
# CUDA
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python
# Metal (Apple Silicon)
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
Memory Management¶
# Reduce context for lower memory
export TRIAGE_LLM_CTX=2048
# Use quantized models (Q4, Q5)
# Smaller = less accurate but faster
Troubleshooting¶
Import Errors¶
# Reinstall with correct flags
pip uninstall llama-cpp-python
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
Slow Inference¶
- Use quantized models (Q5_K_S recommended)
- Enable GPU acceleration
- Reduce context window
- Lower max tokens
Out of Memory¶
- Use smaller model (7B instead of 13B)
- Reduce context window
- Close other applications
Debug Mode¶
Best Practices¶
✅ Use quantized models (Q5_K_S or Q4_K_M)
✅ Enable GPU acceleration when available
✅ Set appropriate context window for your RAM
✅ Monitor resource usage during generation
✅ Use lower temperature for focused outputs
✅ Enable debug mode for troubleshooting
❌ Don't use unquantized models (too large)
❌ Don't set context > 8192 without sufficient RAM
❌ Don't ignore timeout warnings
❌ Don't disable guardrails in production
See Production Generation for monitoring LLM-enhanced dataset creation.