Modeling & Evaluation¶
Text Representation¶
The baseline models use TF–IDF over cleaned incident descriptions:
- Unigrams + bigrams
- English stopword removal
min_dfandmax_dfthresholds to drop extremely rare / frequent terms- Max feature cap (e.g., 5,000 terms)
Cleaning is shared between training and inference via triage.preprocess.clean_description.
Models compared¶
Notebook 08_Model_Comparison trains and compares several models:
logreg_baseline— Logistic Regression (multi-class, one-vs-rest)linear_svm— Linear SVM classifierrandom_forest— Random Forest over TF–IDF features
On the synthetic test set (~20k rows), these models all achieve around 92% accuracy, with:
- High precision/recall for clear-cut classes (
phishing,malware,web_attack,data_exfiltration) - Lower but still strong performance on more ambiguous narratives (
benign_activity, overlappingpolicy_violationscenarios)
Scenario-based evaluation¶
Notebook 07_Scenario_Based_Evaluation:
- Tests the model on hand-crafted narratives that look like real tickets or alerts
- Compares expected event_type vs model prediction
- Surfaces realistic errors:
- Some benign operations labeled as
web_attackorpolicy_violation - Some policy violations near data movement labeled as
data_exfiltration - Borderline authentication issues split between
access_abuseandbenign_activity
This helps validate that the model is learning semantically meaningful patterns rather than just memorizing templates.
Training artifacts & reproducibility¶
- Notebooks 02–04 export
models/vectorizer.joblibandmodels/baseline_logreg.joblib. triage.model.load_vectorizer_and_model()(used by the CLI and tests) expects those filenames, so keep them consistent.- To experiment with alternate classifiers, either:
- Update
notebooks/08_model_comparison.ipynband drop extra.joblibfiles next to the baseline, or - Edit
src/triage/model.py/src/triage/cli.pyto point at the new artifact names. - The CLI always calls
predict_proba, so stick to classifiers that expose that method or wrap them withCalibratedClassifierCV.
Takeaways¶
- TF–IDF + simple linear models can perform surprisingly well on structured, synthetic incident narratives.
- For real SOC deployment, you would likely:
- Incorporate structured features (severity, log source, time of day, etc.)
- Move to transformer-based embeddings
- Tighten evaluation on true production tickets
When experimenting, rerun the pytest suite to confirm the refreshed artifacts still contain the expected class labels.