System Architecture¶

Overview¶

The NLP-Driven Incident Triage system consists of multiple interconnected components that work together to provide intelligent incident classification.

Component Diagram¶

graph TB
    subgraph "Data Layer"
        A[Synthetic Generator] -->|LLM Enhancement| B[Dataset CSV]
        B --> C[Checkpoint System]
    end

    subgraph "Processing Layer"
        D[Preprocessing] --> E[TF-IDF Vectorization]
        E --> F[Logistic Regression]
    end

    subgraph "Intelligence Layer"
        G[Baseline Classifier] --> H{Confidence Check}
        H -->|High| I[Direct Classification]
        H -->|Low| J[LLM Second Opinion]
        J --> K[Final Decision]
        I --> K
    end

    subgraph "Interface Layer"
        L[CLI Tool]
        M[Streamlit UI]
        N[JSON API]
    end

    B --> D
    F --> G
    K --> L
    K --> M
    K --> N

Components¶

Data Generation¶

Generator: Creates synthetic incidents with MITRE enrichment
LLM Rewriter: Enhances narratives using llama.cpp
Checkpoint System: Enables resumable generation

Processing Pipeline¶

Text Cleaning: Unicode normalization, lowercase, punctuation
TF-IDF: Bag-of-words with bigrams (~5k features)
Classifier: Logistic Regression with class balancing

Intelligence Layer¶

Uncertainty Detection: Configurable confidence thresholds
LLM Integration: Local llama.cpp for second opinions
Guardrails: JSON parsing, keyword validation, label normalization

Interfaces¶

CLI: Rich-formatted terminal interface
Streamlit UI: Interactive web dashboard
JSON Output: Scriptable batch processing

Data Flow¶

Generation: Synthetic incidents created with optional LLM enhancement
Storage: CSV format with checkpointing for large datasets
Preprocessing: Shared cleaning pipeline ensures consistency
Vectorization: TF-IDF transforms text to numerical features
Classification: Baseline model produces probability distribution
Uncertainty Handling: Low-confidence cases routed to LLM
Output: Results delivered via CLI, UI, or JSON

Technology Stack¶

Python 3.11+: Core language
scikit-learn: ML framework
llama-cpp-python: Local LLM inference
Streamlit: Web UI framework
Rich: Terminal formatting
pytest: Testing framework
MkDocs Material: Documentation

See Model Information for ML details.