Frequently Asked Questions¶
General¶
What is this project?¶
An educational cybersecurity triage platform demonstrating NLP-based incident classification with LLM enhancement and uncertainty-aware predictions.
Is this production-ready?¶
No. This is designed for education, research, and portfolio demonstration. It requires evaluation on real data before operational use.
Can I use this in my SOC?¶
Not without extensive testing and validation. Treat it as decision-support only, requiring human analyst oversight.
Technical¶
What data is this trained on?¶
Entirely synthetic data generated by the included scripts. Performance on real incidents is unknown.
Why does it sometimes say "uncertain"?¶
The model uses confidence thresholds to avoid low-quality predictions. "Uncertain" means manual review is recommended.
How do I improve accuracy?¶
- Adjust confidence threshold
- Enable LLM second opinion
- Use difficulty mode for edge cases
- Train on real data (requires significant work)
Does the LLM send data externally?¶
No. All LLM inference uses local models via llama.cpp. No data leaves your machine.
Dataset Generation¶
How long does generation take?¶
- Without LLM: ~2.5 hours for 50k incidents
- With 30% LLM: ~10 hours for 50k incidents
- Performance varies by hardware
Can I resume interrupted generation?¶
Yes! The system uses checkpointing. Just rerun the launcher.
What if I run out of disk space?¶
The dataset is ~2MB per 1000 incidents. Plan accordingly for large generations.
LLM Integration¶
Which model should I use?¶
Llama-2-7B-Chat (Q5_K_S) is recommended for balance of quality and speed.
Do I need a GPU?¶
No, but it helps. The system works on CPU, just slower.
Why is LLM inference slow?¶
Local LLM inference is computationally intensive. Use quantized models and GPU acceleration when possible.
CLI & UI¶
How do I enable LLM in the CLI?¶
Use the --llm-second-opinion flag.
Can I process multiple incidents at once?¶
Yes! Use --input-file for bulk mode.
How do I get JSON output?¶
Add the --json flag.
Troubleshooting¶
"Model files not found"¶
Run a notebook or test first - they trigger automatic dataset download.
"LLM import failed"¶
Install llama-cpp-python: pip install llama-cpp-python
"Out of memory during generation"¶
Reduce LLM rewrite probability or chunk size.
"Unexpected label in LLM output"¶
This is normal - guardrails map variations to canonical labels.
Contributing¶
How can I contribute?¶
See Contributing Guide for details.
Where do I report bugs?¶
Can I add new features?¶
Yes! Fork, develop, test, and submit a pull request.
Still have questions? Open a Discussion.