Skip to content

NLP-Driven Incident Triage

Educational cybersecurity incident triage platform demonstrating intelligent classification through Natural Language Processing, LLM enhancement, and uncertainty-aware predictions.

  • Quick Start

Get up and running in minutes with our streamlined setup guide

Getting Started

  • CLI Tool

Powerful command-line interface for incident classification

CLI Usage

  • Web Interface

Interactive Streamlit UI with visual analytics and bulk processing

UI Guide

  • Dataset Generation

Create synthetic SOC datasets with LLM enhancement

Data & Generator


Overview

An educational/research platform demonstrating intelligent cybersecurity incident triage through Natural Language Processing. This project showcases how analyst-style narratives can be converted into structured incident categories using a transparent, reproducible ML workflow.

Educational Project - Not Production IR Tooling

This project is designed for education, research, and portfolio demonstration.
It is not a drop-in replacement for enterprise incident response systems and should not be deployed unsupervised in a live SOC environment.


✨ Key Features

  • LLM-Enhanced Generation

Local llama.cpp models for privacy-first intelligent dataset creation with sanitization and caching

  • Uncertainty-Aware Classification

TF-IDF + Logistic Regression with configurable thresholds and intelligent fallback handling

  • Second-Opinion Engine

LLM assistance for uncertain cases with JSON guardrails and hallucination prevention

  • Interactive Analytics

Streamlit UI with real-time classification, bulk analysis, and visual threat intelligence

  • Production Monitoring

Real-time progress tracking, ETA calculation, and resource efficiency metrics

  • Research-Grade Dataset

100k synthetic incidents with MITRE ATT&CK enrichment and realistic noise


🏗️ Architecture Overview

graph TB
    A[Data Generation] -->|LLM Rewriter| B[Synthetic Dataset]
    B --> C[Preprocessing Pipeline]
    C --> D[TF-IDF Vectorization]
    D --> E{Baseline Classifier}
    E -->|High Confidence| F[Direct Classification]
    E -->|Low Confidence| G[LLM Second Opinion]
    G --> H[Final Decision]
    F --> H
    H --> I[CLI Output]
    H --> J[Streamlit UI]
    H --> K[JSON Export]

    style E fill:#9c27b0,stroke:#7b1fa2,color:#fff
    style G fill:#ff9800,stroke:#f57c00,color:#fff
    style H fill:#4caf50,stroke:#388e3c,color:#fff

🎯 Use Cases

  • Learn NLP techniques for cybersecurity
  • Understand uncertainty-aware classification
  • Explore MITRE ATT&CK framework integration
  • Study SOC automation concepts
  • Prototype triage automation ideas
  • Experiment with LLM-enhanced generation
  • Test classification algorithms
  • Develop synthetic security datasets
  • Demonstrate ML engineering skills
  • Showcase end-to-end project development
  • Highlight production-grade tooling
  • Present interactive visualizations

🚀 Quick Examples

CLI Classification

# Basic incident analysis
nlp-triage "User reported suspicious email with attachment"

# JSON output for scripting
nlp-triage --json "Multiple failed login attempts detected"

# LLM-assisted bulk processing
nlp-triage --llm-second-opinion \
  --input-file incidents.txt \
  --output-file results.jsonl

Dataset Generation

# Quick generation (1000 incidents)
python generator/generate_cyber_incidents.py --n-events 1000

# Production with monitoring
./generator/launch_generator.sh 50000 my_dataset
./generator/monitor_generation.sh my_dataset --watch

Streamlit UI

# Launch interactive interface
streamlit run ui_premium.py

📊 What's Inside

Component Description
Dataset 100k synthetic SOC incidents with multi-perspective narratives
Models TF-IDF vectorizer + Logistic Regression baseline
CLI Rich-formatted command-line interface with uncertainty logic
UI Streamlit web application with visual analytics
Notebooks 9 Jupyter notebooks covering full ML pipeline
Generator LLM-enhanced synthetic data creation with monitoring
Tests Comprehensive pytest suite with CI/CD
Docs MkDocs Material documentation site

🎓 Learning Path

New to the project? Follow this recommended learning path:

  1. Getting Started - Set up environment and run first predictions
  2. CLI Usage - Master the command-line interface
  3. Dataset Generation - Understand the synthetic data
  4. Modeling & Evaluation - Deep dive into the ML pipeline
  5. Notebooks Overview - Explore interactive analysis
  6. Development Guide - Contribute to the project

🔬 Technical Highlights

Shared Preprocessing Pipeline

  • Consistent text cleaning across training and inference
  • Unicode normalization and punctuation cleanup
  • TF-IDF feature extraction with 5k feature limit

Uncertainty-Aware Predictions

  • Configurable confidence thresholds
  • Intelligent uncertain fallback for ambiguous cases
  • Scenario-driven behavior matching SOC reality

LLM Integration

  • Privacy-first local inference (llama.cpp)
  • JSON parsing guardrails
  • SOC keyword validation
  • Deterministic rationale generation

Production-Grade Tooling

  • Checkpoint-based resumable generation
  • Real-time progress monitoring
  • Resource efficiency tracking
  • Comprehensive error handling

📚 Documentation Structure

  • User Guide

Learn how to use the tools and interfaces

CLI Usage
Streamlit UI
Dataset Generation
Configuration

  • Technical Deep Dive

Understand the architecture and implementation

Architecture
Model Information
Modeling & Evaluation
LLM Integration

  • Development

Contribute to the project

Development Guide
Testing
API Reference
Contributing

  • Reference

Additional information and resources

Limitations & Safety
MITRE Attribution
FAQ
Glossary


🌟 Project Goals

This project aims to demonstrate:

✅ End-to-end ML pipeline from data generation to deployment
✅ Uncertainty-aware classification for real-world ambiguity
✅ Privacy-first LLM integration for enhanced intelligence
✅ Production-grade monitoring and observability
✅ Interactive visualizations and analytics
✅ Comprehensive documentation and testing


🤝 Contributing

Contributions are welcome! Whether it's:

  • 🐛 Bug reports
  • 💡 Feature requests
  • 📖 Documentation improvements
  • 🔧 Code contributions

See our Contributing Guide to get started.


📄 License

This project is licensed under the Apache License 2.0. See LICENSE for details.