Skip to content

Evaluation Comparison: matrix_3b_standard.json -> matrix_3b_strict.json

  • Before: evaluation/reports/matrix_3b_standard.json
  • After: evaluation/reports/matrix_3b_strict.json
xychart-beta
    title "Before vs After Metrics"
    x-axis ["Pass Rate", "Restriction", "Citation", "Hallucination Proxy", "Domain Alignment"]
    y-axis "Score" 0 --> 1
    bar [0.750, 0.750, 0.000, 0.500, 0.750]
    bar [0.562, 0.562, 0.000, 0.312, 0.562]
xychart-beta
    title "Case Outcomes"
    x-axis ["Improved", "Regressed", "Stable Failures"]
    y-axis "Cases" 0 --> 10
    bar [0, 3, 4]
Metric Before After Delta Direction
pass_rate 0.750 0.562 -0.188 worsened
restriction_compliance 0.750 0.562 -0.188 worsened
citation_coverage 0.000 0.000 +0.000 flat
hallucination_proxy 0.500 0.312 -0.188 improved
domain_alignment 0.750 0.562 -0.188 worsened

Case-level changes

  • Improved cases: -
  • Regressed cases: matrix::boundary::gaming-infra, matrix::boundary::movies-sre, matrix::persona::challenge-weak-plan
  • Stable failures: matrix::boundary::celebrity-saas, matrix::boundary::sports-sponsorship, matrix::boundary::trivia-backend, matrix::regulated::sports-tax

Reading guide

  • A real improvement should raise pass_rate and preserve or improve restriction_compliance.
  • If hallucination_proxy drops only because the model refuses more, treat that as suspicious rather than automatically better.
  • Regressed cases matter more than average deltas when they are in hard-refusal or regulated-domain prompts.