Notebooks Overview¶

The project includes 11 comprehensive Jupyter notebooks covering the complete machine learning pipeline from getting started through production-ready hybrid models. Each notebook has been enhanced with professional visualizations, comprehensive markdown analysis, and practical insights.

Notebook Sequence¶

0. 00_getting_started_tutorial.ipynb - Interactive Getting Started Guide ⭐ START HERE¶

Purpose: Beginner-friendly interactive tutorial introducing AlertSage fundamentals for new users and contributors.

Key Features:

Environment Setup: Verification of Python, packages, models, and dataset
Model Loading: Load pre-trained TF-IDF vectorizer and baseline logistic regression
First Prediction: Step-by-step walkthrough of single incident analysis
Batch Processing: Analyze 30 diverse incidents across all 10 event types
4 Interactive Visualizations:
Class distribution bar chart
Confidence score histogram with uncertainty threshold
Confidence by event type box plots
Confusion matrix heatmap
Uncertainty Analysis: Understanding confidence thresholds (50%, 60%, 75%)
LLM Integration: Conceptual overview of ML+LLM hybrid approach
3 Hands-On Exercises: Custom incident analysis, threshold experimentation, problematic case identification
Next Steps Guide: Links to advanced notebooks, CLI usage, Streamlit UI, and documentation

Learning Outcomes: Understand incident triage workflow, interpret confidence scores, create visualizations, recognize when LLM assistance is needed, practice with real scenarios.

1. 01_explore_dataset.ipynb - Dataset Exploration & Quality Assessment¶

Purpose: Comprehensive exploratory data analysis (EDA) of the synthetic cybersecurity incident dataset.

Key Features:

Event Type Distribution: Color-coded bar charts showing class balance
Temporal Analysis: Incident distribution across 2024 with seasonal patterns
Severity Assessment: Stacked bar charts showing severity levels per event type
Log Source Analysis: Detection system coverage visualizations
Geographic Distribution: Source/destination country analysis
MITRE ATT&CK Mapping: Technique distribution across incident classes

Learning Outcomes: Understand dataset balance, temporal patterns, severity biasing, and log source diversity.

2. 02_prepare_text_and_features.ipynb - Feature Engineering & Text Preprocessing¶

Purpose: Transform raw text descriptions into TF-IDF feature matrices.

Key Features:

Text cleaning pipeline (lowercase, punctuation removal)
TF-IDF vectorization (max_features=3000, min_df=2)
Feature sparsity analysis and vocabulary composition charts
Stratified 80/20 train/test split
Artifact export (vectorizer, matrices) to models/

Learning Outcomes: Master TF-IDF sparse matrices, scikit-learn pipelines, and joblib serialization.

3. 03_baseline_model.ipynb - Logistic Regression Baseline¶

Purpose: Train and evaluate baseline Logistic Regression classifier.

Key Features:

LogisticRegression with multinomial objective
Dual confusion matrices (raw counts + normalized percentages)
Per-class performance bar charts
Model persistence for CLI/UI

Performance: 92-95% overall accuracy, strong on malware/data_exfiltration, weaker on web_attack/access_abuse confusion.

4. 04_model_interpretability.ipynb - Feature Importance Analysis¶

Purpose: Explain model decisions through coefficient analysis.

Key Features:

Top predictive terms per class (coefficient heatmaps)
Weight distribution histograms
Domain knowledge validation (sensible vs spurious correlations)

Learning Outcomes: Understand linear model coefficients as feature importance, validate model reasoning.

5. 05_inference_and_cli.ipynb - Prediction Workflow & CLI Testing¶

Purpose: Demonstrate inference and validate CLI consistency.

Key Features:

Notebook inference with probability outputs
CLI testing (python -m triage.cli)
Interactive session with Ctrl+C handling
Prediction comparison (notebook vs CLI)

Learning Outcomes: Master joblib loading, predict_proba() output, command-line packaging.

6. 06_model_visualization_and_insights.ipynb - Performance Analysis¶

Purpose: Comprehensive visualization and results interpretation.

Key Features:

Probability distribution histograms
Per-class metrics grouped bar charts
Confidence calibration analysis
Comprehensive markdown: Overall performance, confusion patterns, feature validation, deployment readiness, future enhancements

Learning Outcomes: Interpret performance in business context, assess calibration, identify improvement opportunities.

7. 07_scenario_based_evaluation.ipynb - Edge Case Testing¶

Purpose: Validate model on curated test scenarios.

Key Features:

Handcrafted incident scenarios (clear-cut, ambiguous, adversarial, novel phrasing)
Expected vs predicted comparison
Failure analysis with confidence scoring

Learning Outcomes: Understand model limitations, vocabulary gaps, misleading confidence.

8. 08_model_comparison.ipynb - Multi-Model Benchmarking¶

Purpose: Compare Logistic Regression, Linear SVM, and Random Forest.

Key Features:

Side-by-side performance metrics
Enhanced confusion matrices (dual plots + comparative heatmaps)
Model agreement analysis
Training time comparison
Section 8.6 comprehensive analysis: Performance ranking, per-class winners, deployment recommendation

Learning Outcomes: Algorithm selection tradeoffs, when accuracy gains don't justify complexity, ensemble opportunities.

9. 09_operational_decision_support.ipynb - Uncertainty & Thresholds¶

Purpose: Operationalize model with uncertainty quantification.

Key Features:

Uncertainty metrics (confidence, entropy, top-2 gap)
ROC/PR curves per class
Confidence calibration diagrams
SOC analyst escalation framework (auto-triage >0.9, review 0.7-0.9, escalate <0.7)

Learning Outcomes: Quantify prediction uncertainty, set confidence thresholds, design human-in-the-loop workflows.

10. 10_hybrid_model.ipynb - Text + Metadata Feature Fusion (NEW)¶

Purpose: Combine TF-IDF with structured metadata.

Key Features:

Enhanced preprocessor with detailed text summaries
Feature engineering: TF-IDF (3000) + structured (severity, log_source, protocol, ports, is_true_positive)
ColumnTransformer for heterogeneous features
4-model comparison (TF-IDF only, metadata only, hybrid LogReg, hybrid RF)
Performance decomposition and interpretation

Learning Outcomes: Master ColumnTransformer, understand feature fusion tradeoffs, design multi-input pipelines.

Getting Started¶

# Install dependencies
pip install -e ".[dev]"
pip install jupyterlab

# Launch Jupyter Lab
jupyter lab notebooks/

Notebook Enhancements¶

✅ Professional Visualizations: Custom colormaps, dual confusion matrices, grouped bar charts
✅ Comprehensive Markdown: Results interpretation, deployment readiness, future enhancements
✅ Bug Fixes: Notebook 01 alignment, 05 duplicate output/Ctrl+C, 09 histogram 'kde' parameter, 10 enhanced preprocessor outputs
✅ Code Quality: Consistent styling, reproducible seeds (random_state=42), modular cells

Common Issues & Solutions¶

"Model file not found": Run notebooks 02-03 to generate artifacts
"Dataset not found": Generate with ./launch_generator.sh or download pre-generated
Memory issues: Load subset with pd.read_csv(..., nrows=10000)
Plots not displaying: Add %matplotlib inline to first cell
Kernel crashes: Reduce max_features in TF-IDF vectorizer

Best Practices¶

Run notebooks in order first time (artifact dependencies)
Clear outputs before committing (jupyter nbconvert --clear-output)
Set random seeds for reproducibility
Regenerate artifacts if dataset changes
Test on subset before full 100K dataset

Additional Resources¶

Dataset Details: Data & Generator Guide
Production Scripts: Production Generation Guide

Notebooks Overview¶

Notebook Sequence¶

0. 00_getting_started_tutorial.ipynb - Interactive Getting Started Guide ⭐ START HERE¶

1. 01_explore_dataset.ipynb - Dataset Exploration & Quality Assessment¶

2. 02_prepare_text_and_features.ipynb - Feature Engineering & Text Preprocessing¶

3. 03_baseline_model.ipynb - Logistic Regression Baseline¶

4. 04_model_interpretability.ipynb - Feature Importance Analysis¶

5. 05_inference_and_cli.ipynb - Prediction Workflow & CLI Testing¶

6. 06_model_visualization_and_insights.ipynb - Performance Analysis¶

7. 07_scenario_based_evaluation.ipynb - Edge Case Testing¶

8. 08_model_comparison.ipynb - Multi-Model Benchmarking¶

9. 09_operational_decision_support.ipynb - Uncertainty & Thresholds¶

10. 10_hybrid_model.ipynb - Text + Metadata Feature Fusion (NEW)¶

Getting Started¶

Recommended Reading Order¶

Notebook Enhancements¶

Common Issues & Solutions¶

Best Practices¶

Additional Resources¶