Skip to content

v0.2 Agent-Quality Eval Harness — design sketch

Status: design sketch (2026-05-05). Not yet implemented; this document is the input to a v0.2.0 design discussion.

Why this exists: v0.1.5–0.1.9 each surfaced bugs that fall into three buckets. The E2E smoke CI gate (v0.1.10) catches cross-component bugs automatically. The per-fix integration test discipline (CLAUDE.md, CONTRIBUTING.md) catches unit/integration bugs through process. Neither catches agent-quality bugs. Those need a different testing model: a corpus with ground-truth labels, agent runs measured against the labels, metrics tracked over time.

This document sketches that harness.


What "agent-quality bugs" look like

Concrete cases from v0.1.5–0.1.9 shakedowns that the existing test infrastructure does NOT catch:

  • KSI-SVC-PRR drift across releases. v0.1.7 over-classified as partial when the agent reached too liberally; v0.1.8 swung to evidence_layer_inapplicable (under-classified, treating a CMK-backed S3 fixture as procedural-only). Same KSI, different miss, both wrong vs. ground truth.
  • KSI-SCR-MIT "mixed" reasoning bug. v0.1.7 agent rationale literally said "release-deploy and security-scan workflows have 'mixed' pin_state," then concluded not_implemented. Reasoning bug — "mixed" should map to partial. Fixed in v0.1.8 via prompt nudge but no metric tracked the regression risk.
  • F5 manifest cross-wiring. v0.1.8 KSI-AFR-FSI narrative quoted PagerDuty (which belongs to KSI-INR-RIR's manifest); KSI-INR-RIR did NOT quote PagerDuty even though that's its own manifest. Cross-wiring of which manifest belongs to which KSI in narrative drafting.
  • sha256-only citation pollution. 30 of 60 v0.1.8 narratives still cite evidence by sha256:abc12345... inline, despite resource_name being available in the cited evidence's content. Reviewability issue with no metric.
  • Bucket-by-bucket variance not surfaced in rationales. F1 in v0.1.8: agent classified KSI-SVC-VRI / KSI-CNA-MAT as partial (correct verdict) but never named individual S3 buckets in rationale (qualitative gap; reviewer can't tell which buckets are non-compliant).

What unifies these: the verdict-vs-ground-truth is wrong, OR the verdict is right but the rationale is unreviewable. Neither dimension is measured today. Every prompt change risks regressing one or both, silently.


What the harness does

                  ┌────────────────────────┐
                  │  corpus/<fixture>/     │
                  │    infra/terraform/    │
                  │    .github/workflows/  │
                  │    .efterlev/manifests/│
                  │    GROUND_TRUTH.yaml   │  ← human-authored labels
                  └─────────────┬──────────┘
                         ┌──────▼──────┐
                         │ harness     │
                         │ runs Efterlev│
                         │ pipeline    │
                         └──────┬──────┘
                  ┌────────────────────────┐
                  │ run-output/<ts>/       │
                  │   gap-report.json      │
                  │   documentation.json   │
                  │   poam.md              │
                  └─────────────┬──────────┘
                         ┌──────▼──────┐
                         │ metrics     │
                         │ computer    │
                         └──────┬──────┘
                  ┌────────────────────────┐
                  │ metrics/<ts>.json      │
                  │ + delta vs prior run   │
                  └────────────────────────┘

Per-fixture, the harness: 1. Wipes any prior .efterlev/ workspace. 2. Runs efterlev init → boundary set → scan → agent gap → agent document → poam. 3. Loads the fixture's GROUND_TRUTH.yaml and the run's emitted reports. 4. Computes metrics (defined below). 5. Emits a metrics record + a delta vs the prior recorded run on that fixture.

Across the corpus, the harness aggregates: - Mean / per-fixture metrics (precision, recall, etc.) - Confusion matrix (expected status × actual status) - Per-KSI quality scores - Trend lines over the last N runs


Corpus structure

eval-corpus/ (separate repo OR a evals/ subdir under Efterlev — design decision below).

eval-corpus/
  fixtures/
    govnotes-v1/                  # The current Tier 1 fixture (synthetic mid-journey)
      infra/terraform/...
      .github/workflows/...
      .efterlev/manifests/...
      GROUND_TRUTH.yaml
    aws-iam-canonical/            # AWS IAM canonical patterns (sanitized)
      ...
    eks-private-cluster/          # EKS pattern with module composition
      ...
    encryption-mixed/             # 6 buckets, varied encryption posture
      ...
    iam-dynamic-policy/           # aws_iam_policy_document data sources
      ...
    boundary-stale-evidence/      # post-boundary-set re-scan scenario
      ...
  metrics/
    <fixture>/<ts>.json           # Per-run metrics record
  README.md                       # Maintainer guidance
  GROUND_TRUTH_FORMAT.md          # YAML schema docs

GROUND_TRUTH.yaml schema (per-fixture):

# eval-corpus/fixtures/govnotes-v1/GROUND_TRUTH.yaml
fixture_id: govnotes-v1
description: |
  Synthetic mid-journey FedRAMP boundary fixture; deliberate gaps at
  bucket-by-bucket variance, mixed log-group retention, dev_sandbox
  scoped out-of-boundary.
authored_by: maintainer@efterlev.com
authored_at: 2026-04-30
revision: 3

# Expected KSI classifications. The agent's gap report's
# ksi_classifications[].status is compared against this.
expected_classifications:
  KSI-SVC-VRI: partial
  KSI-SVC-PRR: partial            # CMK posture across 6 buckets
  KSI-CNA-MAT: partial             # Public-access blocks mixed
  KSI-MLA-LET: partial             # Log retention variance
  KSI-IAM-MFA: partial             # Some policies enforce, some don't
  KSI-SCR-MON: partial             # SBOM via syft, no CVE scan
  KSI-SCR-MIT: partial             # Mixed action pin posture
  KSI-CMT-RMV: partial             # terraform apply + aws s3 sync mix
  # ... (full 60-KSI matrix)

# Expected resource-name mentions in rationales (for narrative-quality
# metric). Each rationale for the listed KSI MUST mention at least one
# of the named resources, with an exact-string match.
expected_rationale_resources:
  KSI-SVC-VRI:
    - app_uploads          # Should be cited for fully-compliant
    - legacy_export        # Should be cited as a gap
    - temp_data_pipeline   # Should be cited as the worst gap
  KSI-MLA-LET:
    - vpc_flow_logs
    - experiments
    - integrations

# Expected manifest-narrative quoting (per-procedural-KSI). The doc
# agent's narrative for the listed KSI MUST quote at least one
# substring from the named manifest's `statement:` field.
expected_manifest_quoting:
  KSI-AFR-FSI:
    - "security@govnotes.fed"
    - "PagerDuty"
  KSI-CED-RGT:
    - "KnowBe4"
  KSI-INR-RIR:
    - "PagerDuty"

# POAM scope expectations.
expected_poam:
  excluded_count_min: 1            # At least N items excluded as out_of_boundary
  excluded_count_max: 6
  must_not_mention:
    - dev_scratch                  # OOB resources should NOT appear in POAM rationale

Metrics

Five metrics, each producing a 0–1 score. Track per-fixture and aggregate across the corpus.

M1. Status precision (per-KSI)

hits / (hits + over_classifications). A "hit" is expected_classifications[ksi_id] == actual.status. An over-classification is expected != actual where the actual is more positive (implemented when expected is partial, etc.).

M2. Status recall (per-KSI)

hits / (hits + under_classifications). An under-classification is expected != actual where the actual is more negative (evidence_layer_inapplicable when expected is partial, etc.). KSI-SVC-PRR's v0.1.7→v0.1.8 drift would surface as a recall regression.

M3. Resource-naming rate (narrative quality)

rationales_naming_at_least_one_expected_resource / rationales_with_expected_resources. Pre-v0.1.10 baseline: 30/60 narratives still embed sha256:abc12345... inline. Target: ≥80%.

M4. Manifest-quoting accuracy

narratives_correctly_quoting_their_manifest / narratives_with_expected_quoting. Catches the F5 cross-wiring bug (v0.1.8 KSI-AFR-FSI narrative quoted PagerDuty when it should have quoted security@govnotes.fed only).

M5. POAM scope discipline

Boolean: 0 if any expected_poam.must_not_mention substring appears in any POAM rationale, else 1. Plus 1 if expected_poam.excluded_count_min ≤ actual ≤ expected_poam.excluded_count_max. Catches boundary-leak regressions.

Composite score: mean(M1, M2, M3, M4, M5). Per-run; per-fixture; aggregate. A regression on any single metric of >5% from the prior run blocks the merge.


Implementation surface

evals/                                    # (or eval-corpus/ as a sibling repo)
  cli.py                                  # `python -m evals run --corpus eval-corpus`
  metrics.py                              # M1-M5 implementations
  ground_truth.py                         # YAML loader + Pydantic models
  diff.py                                 # Run-vs-prior delta
  report.py                               # Markdown / HTML rendering for dashboards
tests/test_evals_*.py                     # Unit tests for metric functions
.github/workflows/eval-quality.yml        # Scheduled (daily?) run + dashboard upload

Existing infrastructure to reuse: - scripts/e2e_smoke.py — already lays down a synthetic fixture and runs the pipeline. Generalize: take a fixture path, load it instead of hardcoded. - tests/test_e2e_smoke.py — pytest wrapper pattern; eval CI uses the same shape. - efterlev provenance verify — sanity-check the run's store before computing metrics.

Net-new code: ~600 LOC (corpus loader + metric functions + diff + report). Plus ~6-10 fixtures (~2-3 hours each to author + ground-truth-label).


Cost model

Per-run cost (one corpus pass): - Sonnet 4.6: ~$0.30/fixture × 8 fixtures = ~$2.40 per pass. - Opus 4.7: ~$2/fixture × 8 = ~$16. Run weekly only.

Schedule: - Pre-merge gate on agent-prompt-touching PRs: Sonnet only (~$2.40/PR). Actually maybe MORE selective — only PRs that touch src/efterlev/agents/*_prompt.md or change agent class definitions. - Weekly full pass (Sonnet + Opus): ~$18/week. - Trend analysis quarterly.

Total monthly cost at 5 prompt-touching PRs/week + weekly Opus + manual debugging runs: ~$80-120/month. Material but not crippling.


What's needed before we can build this

The blockers are corpus, not code:

  1. Sanitized real-customer Terraform. govnotes is synthetic. Sanitized real Terraform from a friendly customer (or a foundation reference architecture that genuinely hits the patterns we care about) would be ~5x more useful than synthetic. Want at least 2-3 real-shaped fixtures.
  2. Ground-truth labeler with FedRAMP expertise. The maintainer can label ~80% of cases; the remaining 20% need someone who's done a 3PAO walkthrough or an ATO submission. The eval is only as good as its labels.
  3. A baseline. Before any prompt change, capture metrics on the current fixture set. Becomes the regression floor.

These are sequenceable: - Phase 1 (week of v0.2.0 cut): ship the harness + 3 synthetic fixtures (govnotes-v1, encryption-mixed, iam-dynamic-policy). Baseline-from-zero. Internal-only. Shipped at v0.1.43 — extended in subsequent releases to 5 synthetic fixtures (added runtime-monitoring + serverless-mixed). - Phase 2 (during v0.2.x): add 2-3 sanitized real fixtures as customer relationships permit. Solicit ground-truth labeling help. Shipped as "Phase 2 lite" v0.1.63-v0.1.67 with three vendored real-shape fixtures from terraform-aws-modules/* (vpc, s3-bucket, lambda) per the 2026-05-11 strategic-arc decision (memory: project_efterlev_strategic_arc_2026-05.md). The "lite" framing reflects the path picked: public open-source corpora hand-graded by the maintainer (vs design-partner real-customer boundaries). 67 conservative claude-drafted labels accumulated; maintainer review pass with ANTHROPIC_API_KEY available is the next step. - Phase 3 (v0.3.0): pre-merge eval gate becomes required for prompt-touching PRs. Public dashboard. Still deferred — pending Phase 2 lite labeled-trend data accumulating across enough runs to establish a real noise-floor.


Decision on corpus location

Two options:

(a) evals/ inside the Efterlev repo. Pros: single-source-of-truth; reviewers see eval changes in the same PR as prompt changes. Cons: corpus could grow large; PRs that don't touch evals carry unnecessary diff weight.

(b) eval-corpus as a sibling repo. Pros: clean separation; eval contributors don't need full-Efterlev clone. Cons: cross-repo PRs are harder; risk of corpus drifting from prompts.

Recommendation: (a), with a .gitattributes rule marking evals/fixtures/** as linguist-vendored=true (so GitHub doesn't count it in language stats / blame) and evals/metrics/** in .gitignore (the metrics are runtime artifacts, not source). Single-source-of-truth wins for v0.2.


Open questions

  1. Determinism. Even on Sonnet 4.6, two consecutive runs against the same fixture can produce different rationales. How tight should the metric tolerance be? A ±5% per-metric per-run noise floor seems realistic; the >5% regression block needs to factor that in.
  2. Multi-model coverage. Should the eval run against Sonnet AND Opus? Different models will produce different rationale styles; baseline-per-model. Cost doubles.
  3. Ground truth for "evidence_layer_inapplicable." This status is the agent's "I can't see this from IaC" answer. Hard to label in YAML — the right answer depends on whether the workspace has manifests for the procedural KSIs. Maybe expected_classifications carries <status> OR <status>|<status> to express acceptable alternatives.
  4. Versioning. When FRMR catalog updates (KSI ids change, statuses redefined), the fixtures' ground-truth needs migration. Codify the migration process.
  5. Privacy posture for sanitized real fixtures. What's the redaction protocol? Account IDs, IPs, resource names tied to customer brand — all need scrubbing.

Scope checkpoint

This document is not a v0.1.x deliverable. It's an input to the v0.2.0 planning conversation.

v0.1.x finishes when the manual-loop-per-patch cycle stops surfacing release-blocker bugs. Based on v0.1.7-0.1.9 trend (release-blockers: 1 → 0 → 0), we're close. Maybe v0.1.10 + v0.1.11 + the E2E CI gate land us there.

v0.2.0 starts when we're ready to invest in agent quality. The eval harness is the first thing v0.2 needs.


References

  • v0.1.5-0.1.9 shakedown reports (in conversation history; not committed).
  • tests/test_e2e_smoke.py — pattern for harness invocation.
  • scripts/e2e_smoke.py — full-pipeline driver to generalize.
  • DECISIONS 2026-05-04 (out-of-boundary evidence visibility) and 2026-05-05 (manifests in boundary, narrative variance signal) — semantics the eval harness must respect.
  • docs/dual_horizon_plan.md — broader strategic context.