NLP-Driven Incident Triage¶
Educational cybersecurity incident triage platform demonstrating intelligent classification through Natural Language Processing, LLM enhancement, and uncertainty-aware predictions.
- Quick Start
Get up and running in minutes with our streamlined setup guide
- CLI Tool
Powerful command-line interface for incident classification
- Web Interface
Interactive Streamlit UI with visual analytics and bulk processing
- Dataset Generation
Create synthetic SOC datasets with LLM enhancement
Overview¶
An educational/research platform demonstrating intelligent cybersecurity incident triage through Natural Language Processing. This project showcases how analyst-style narratives can be converted into structured incident categories using a transparent, reproducible ML workflow.
Educational Project - Not Production IR Tooling
This project is designed for education, research, and portfolio demonstration.
It is not a drop-in replacement for enterprise incident response systems and should not be deployed unsupervised in a live SOC environment.
✨ Key Features¶
- LLM-Enhanced Generation
Local llama.cpp models for privacy-first intelligent dataset creation with sanitization and caching
- Uncertainty-Aware Classification
TF-IDF + Logistic Regression with configurable thresholds and intelligent fallback handling
- Second-Opinion Engine
LLM assistance for uncertain cases with JSON guardrails and hallucination prevention
- Interactive Analytics
Streamlit UI with real-time classification, bulk analysis, and visual threat intelligence
- Production Monitoring
Real-time progress tracking, ETA calculation, and resource efficiency metrics
- Research-Grade Dataset
100k synthetic incidents with MITRE ATT&CK enrichment and realistic noise
🏗️ Architecture Overview¶
graph TB
A[Data Generation] -->|LLM Rewriter| B[Synthetic Dataset]
B --> C[Preprocessing Pipeline]
C --> D[TF-IDF Vectorization]
D --> E{Baseline Classifier}
E -->|High Confidence| F[Direct Classification]
E -->|Low Confidence| G[LLM Second Opinion]
G --> H[Final Decision]
F --> H
H --> I[CLI Output]
H --> J[Streamlit UI]
H --> K[JSON Export]
style E fill:#9c27b0,stroke:#7b1fa2,color:#fff
style G fill:#ff9800,stroke:#f57c00,color:#fff
style H fill:#4caf50,stroke:#388e3c,color:#fff 🎯 Use Cases¶
- Learn NLP techniques for cybersecurity
- Understand uncertainty-aware classification
- Explore MITRE ATT&CK framework integration
- Study SOC automation concepts
- Prototype triage automation ideas
- Experiment with LLM-enhanced generation
- Test classification algorithms
- Develop synthetic security datasets
- Demonstrate ML engineering skills
- Showcase end-to-end project development
- Highlight production-grade tooling
- Present interactive visualizations
🚀 Quick Examples¶
CLI Classification¶
# Basic incident analysis
nlp-triage "User reported suspicious email with attachment"
# JSON output for scripting
nlp-triage --json "Multiple failed login attempts detected"
# LLM-assisted bulk processing
nlp-triage --llm-second-opinion \
--input-file incidents.txt \
--output-file results.jsonl
Dataset Generation¶
# Quick generation (1000 incidents)
python generator/generate_cyber_incidents.py --n-events 1000
# Production with monitoring
./generator/launch_generator.sh 50000 my_dataset
./generator/monitor_generation.sh my_dataset --watch
Streamlit UI¶
📊 What's Inside¶
| Component | Description |
|---|---|
| Dataset | 100k synthetic SOC incidents with multi-perspective narratives |
| Models | TF-IDF vectorizer + Logistic Regression baseline |
| CLI | Rich-formatted command-line interface with uncertainty logic |
| UI | Streamlit web application with visual analytics |
| Notebooks | 9 Jupyter notebooks covering full ML pipeline |
| Generator | LLM-enhanced synthetic data creation with monitoring |
| Tests | Comprehensive pytest suite with CI/CD |
| Docs | MkDocs Material documentation site |
🎓 Learning Path¶
New to the project? Follow this recommended learning path:
- Getting Started - Set up environment and run first predictions
- CLI Usage - Master the command-line interface
- Dataset Generation - Understand the synthetic data
- Modeling & Evaluation - Deep dive into the ML pipeline
- Notebooks Overview - Explore interactive analysis
- Development Guide - Contribute to the project
🔬 Technical Highlights¶
Shared Preprocessing Pipeline¶
- Consistent text cleaning across training and inference
- Unicode normalization and punctuation cleanup
- TF-IDF feature extraction with 5k feature limit
Uncertainty-Aware Predictions¶
- Configurable confidence thresholds
- Intelligent
uncertainfallback for ambiguous cases - Scenario-driven behavior matching SOC reality
LLM Integration¶
- Privacy-first local inference (llama.cpp)
- JSON parsing guardrails
- SOC keyword validation
- Deterministic rationale generation
Production-Grade Tooling¶
- Checkpoint-based resumable generation
- Real-time progress monitoring
- Resource efficiency tracking
- Comprehensive error handling
📚 Documentation Structure¶
- User Guide
Learn how to use the tools and interfaces
CLI Usage
Streamlit UI
Dataset Generation
Configuration
- Technical Deep Dive
Understand the architecture and implementation
Architecture
Model Information
Modeling & Evaluation
LLM Integration
- Development
Contribute to the project
Development Guide
Testing
API Reference
Contributing
- Reference
Additional information and resources
🌟 Project Goals¶
This project aims to demonstrate:
✅ End-to-end ML pipeline from data generation to deployment
✅ Uncertainty-aware classification for real-world ambiguity
✅ Privacy-first LLM integration for enhanced intelligence
✅ Production-grade monitoring and observability
✅ Interactive visualizations and analytics
✅ Comprehensive documentation and testing
🤝 Contributing¶
Contributions are welcome! Whether it's:
- 🐛 Bug reports
- 💡 Feature requests
- 📖 Documentation improvements
- 🔧 Code contributions
See our Contributing Guide to get started.
📄 License¶
This project is licensed under the Apache License 2.0. See LICENSE for details.