NLP-Driven Incident Triage¶

Educational cybersecurity incident triage platform demonstrating intelligent classification through Natural Language Processing, LLM enhancement, and uncertainty-aware predictions.

Quick Start

Get up and running in minutes with our streamlined setup guide

Getting Started

CLI Tool

Powerful command-line interface for incident classification

CLI Usage

Web Interface

Interactive Streamlit UI with visual analytics and bulk processing

UI Guide

Dataset Generation

Create synthetic SOC datasets with LLM enhancement

Data & Generator

Overview¶

An educational/research platform demonstrating intelligent cybersecurity incident triage through Natural Language Processing. This project showcases how analyst-style narratives can be converted into structured incident categories using a transparent, reproducible ML workflow.

Educational Project - Not Production IR Tooling

This project is designed for education, research, and portfolio demonstration.
It is not a drop-in replacement for enterprise incident response systems and should not be deployed unsupervised in a live SOC environment.

✨ Key Features¶

LLM-Enhanced Generation

Local llama.cpp models for privacy-first intelligent dataset creation with sanitization and caching

Uncertainty-Aware Classification

TF-IDF + Logistic Regression with configurable thresholds and intelligent fallback handling

Second-Opinion Engine

LLM assistance for uncertain cases with JSON guardrails and hallucination prevention

Interactive Analytics

Streamlit UI with real-time classification, bulk analysis, and visual threat intelligence

Production Monitoring

Real-time progress tracking, ETA calculation, and resource efficiency metrics

Research-Grade Dataset

100k synthetic incidents with MITRE ATT&CK enrichment and realistic noise

🏗️ Architecture Overview¶

graph TB
    A[Data Generation] -->|LLM Rewriter| B[Synthetic Dataset]
    B --> C[Preprocessing Pipeline]
    C --> D[TF-IDF Vectorization]
    D --> E{Baseline Classifier}
    E -->|High Confidence| F[Direct Classification]
    E -->|Low Confidence| G[LLM Second Opinion]
    G --> H[Final Decision]
    F --> H
    H --> I[CLI Output]
    H --> J[Streamlit UI]
    H --> K[JSON Export]

    style E fill:#9c27b0,stroke:#7b1fa2,color:#fff
    style G fill:#ff9800,stroke:#f57c00,color:#fff
    style H fill:#4caf50,stroke:#388e3c,color:#fff

🎯 Use Cases¶

EducationResearchPortfolio

Learn NLP techniques for cybersecurity
Understand uncertainty-aware classification
Explore MITRE ATT&CK framework integration
Study SOC automation concepts

Prototype triage automation ideas
Experiment with LLM-enhanced generation
Test classification algorithms
Develop synthetic security datasets

Demonstrate ML engineering skills
Showcase end-to-end project development
Highlight production-grade tooling
Present interactive visualizations

🚀 Quick Examples¶

CLI Classification¶

# Basic incident analysis
nlp-triage "User reported suspicious email with attachment"

# JSON output for scripting
nlp-triage --json "Multiple failed login attempts detected"

# LLM-assisted bulk processing
nlp-triage --llm-second-opinion \
  --input-file incidents.txt \
  --output-file results.jsonl

Dataset Generation¶

# Quick generation (1000 incidents)
python generator/generate_cyber_incidents.py --n-events 1000

# Production with monitoring
./generator/launch_generator.sh 50000 my_dataset
./generator/monitor_generation.sh my_dataset --watch

Streamlit UI¶

# Launch interactive interface
streamlit run ui_premium.py

📊 What's Inside¶

Component	Description
Dataset	100k synthetic SOC incidents with multi-perspective narratives
Models	TF-IDF vectorizer + Logistic Regression baseline
CLI	Rich-formatted command-line interface with uncertainty logic
UI	Streamlit web application with visual analytics
Notebooks	9 Jupyter notebooks covering full ML pipeline
Generator	LLM-enhanced synthetic data creation with monitoring
Tests	Comprehensive pytest suite with CI/CD
Docs	MkDocs Material documentation site

🎓 Learning Path¶

New to the project? Follow this recommended learning path:

Getting Started - Set up environment and run first predictions
CLI Usage - Master the command-line interface
Dataset Generation - Understand the synthetic data
Modeling & Evaluation - Deep dive into the ML pipeline
Notebooks Overview - Explore interactive analysis
Development Guide - Contribute to the project

🔬 Technical Highlights¶

Shared Preprocessing Pipeline¶

Consistent text cleaning across training and inference
Unicode normalization and punctuation cleanup
TF-IDF feature extraction with 5k feature limit

Uncertainty-Aware Predictions¶

Configurable confidence thresholds
Intelligent uncertain fallback for ambiguous cases
Scenario-driven behavior matching SOC reality

LLM Integration¶

Privacy-first local inference (llama.cpp)
JSON parsing guardrails
SOC keyword validation
Deterministic rationale generation

Production-Grade Tooling¶

Checkpoint-based resumable generation
Real-time progress monitoring
Resource efficiency tracking
Comprehensive error handling

📚 Documentation Structure¶

User Guide

Learn how to use the tools and interfaces

CLI Usage
Streamlit UI
Dataset Generation
Configuration

Technical Deep Dive

Understand the architecture and implementation

Architecture
Model Information
Modeling & Evaluation
LLM Integration

Development

Contribute to the project

Development Guide
Testing
API Reference
Contributing

Reference

Additional information and resources

Limitations & Safety
MITRE Attribution
FAQ
Glossary

🌟 Project Goals¶

This project aims to demonstrate:

✅ End-to-end ML pipeline from data generation to deployment
✅ Uncertainty-aware classification for real-world ambiguity
✅ Privacy-first LLM integration for enhanced intelligence
✅ Production-grade monitoring and observability
✅ Interactive visualizations and analytics
✅ Comprehensive documentation and testing

🤝 Contributing¶

Contributions are welcome! Whether it's:

🐛 Bug reports
💡 Feature requests
📖 Documentation improvements
🔧 Code contributions

See our Contributing Guide to get started.

📄 License¶

This project is licensed under the Apache License 2.0. See LICENSE for details.

NLP-Driven Incident Triage¶

Overview¶

✨ Key Features¶

🏗️ Architecture Overview¶

🎯 Use Cases¶

🚀 Quick Examples¶

CLI Classification¶

Dataset Generation¶

Streamlit UI¶

📊 What's Inside¶

🎓 Learning Path¶

🔬 Technical Highlights¶

Shared Preprocessing Pipeline¶

Uncertainty-Aware Predictions¶

LLM Integration¶

Production-Grade Tooling¶

📚 Documentation Structure¶

🌟 Project Goals¶

🤝 Contributing¶

📄 License¶

🔗 Links¶