16 Guardrail Matrix¶
Use this guide when you want to compare guardrail profiles several times without waiting for the full evaluation suite every time.
Why this exists¶
The full suite in evaluation/cases/core_eval_cases.jsonl is the final sign-off set. It is too heavy for repeated local profile comparisons on MLX.
For repeated comparisons, use:
Recommended first matrix¶
Run this first:
UV_CACHE_DIR=.uv-cache uv run python scripts/benchmark_guardrail_profiles.py \
--model-profile qwen2.5-3b-instruct
This benchmarks:
original_modelrelaxedstandardstrict
against a smaller case set focused on:
- hard refusals
- boundary cases
- persona behavior
- a small number of allowed-domain answers
Recommended workflow¶
- run the small matrix on
qwen2.5-3b-instruct - choose the best candidate guardrail profile
- rerun the full suite on that candidate using evaluation/cases/core_eval_cases.jsonl
- if the candidate still looks good, repeat the same process on
qwen2.5-7b-instruct - only then start the first serious LoRA run
Archive and compare the reports with:
How to interpret results¶
original_modeltells you what the raw base model doesrelaxedshows how far you can reduce refusals without losing too much controlstandardis the recommended defaultstrictis useful when mixed-domain prompts should be refused more aggressively
Important note¶
The current evaluation scorer is intentionally lightweight. Treat the matrix as a fast comparison tool, not a final scientific benchmark.
For a beginner-friendly explanation of what the metrics mean, read docs/17_reading_evaluation_results.md.