Time-to-FRMR benchmark — methodology + results¶
Status: v0.1.112 ships the harness + methodology. Result tables
populated by maintainer-dispatched runs of the
benchmark-dispatch.yml workflow. Update this file with each
result-publishing release.
Last updated: 2026-05-15 (v0.1.112 harness ships; numbers pending).
What this measures¶
Time-to-FRMR is the wall-clock from invoking efterlev report run
to the last artifact landing on disk. The pipeline at v0.1.111 is:
Three deterministic stages (init, scan, poam, oscal poam, oscal cd) plus two LLM-bearing stages (agent gap, agent document). The benchmark captures:
- Wall-clock seconds — the number a customer cares about.
- LLM cost (USD) — sum of receipts.log entries past pipeline-start.
- Token usage per model — input + output tokens for cost-vs-quality reasoning.
- KSI classifications produced — sanity check on output volume.
- Artifact presence — did we actually land OSCAL POA&M + CD on disk (regression guard against silent failure modes).
What this is NOT measuring¶
- Authorization timeline. "Tool work done in 12 minutes" is not "FedRAMP authorization in 12 minutes." The customer-review, 3PAO assessment, and FedRAMP PMO acceptance steps remain. Don't read the numbers as authorization-completion metrics.
- Customer review of drafted artifacts. Every Efterlev-drafted
narrative carries
requires_review: Literal[True]; reviewer time is the next-largest term in the equation, and Efterlev doesn't measure it. - Anything beyond a fixture. The numbers reflect repo-shape
fixtures vendored under
evals/fixtures/. Customer codebases will vary: bigger codebases → more evidence to classify → more LLM tokens → higher wall-clock. The published numbers are floor estimates for a Phase-2-lite-shaped workload.
How to reproduce¶
Local (uses your ANTHROPIC_API_KEY):
uv run python scripts/benchmark.py \
--fixture evals/fixtures/csp-starter-cfn \
--model claude-haiku-4-5 \
--runs 3
Output lands under .benchmark-results/<timestamp>/summary.json with
per-run + aggregate (mean, median, p95, max) for wall-clock + cost.
CI (workflow_dispatch):
gh workflow run benchmark-dispatch.yml \
-f fixture=evals/fixtures/csp-starter-cfn \
-f llm_model=claude-haiku-4-5 \
-f runs=3
The workflow uploads .benchmark-results/** as an artifact retained
for 90 days; download via gh run download.
Methodology¶
Why latency is reported as median + p95 + max, not just mean. Anthropic API latency is bimodal: most runs cluster near 60-90s for a typical Phase-2-lite fixture, but the tail extends to 200s+ when the API hits regional load. Mean alone hides the tail. Customer-experience math uses p95 (the worst run in 20).
Why N=3 by default. Three runs is the smallest sample where median is meaningful and p95 is suggestive. N=10 would be more statistically sound but costs ~3.3× more API spend; for a maintainer-dispatched benchmark, N=3 + occasional N=10 deep-dive is the right ratio.
Why Haiku 4.5 is the default measurement model. The CFN maintainer-validation pass at v0.1.81 + v0.1.98 hit 44/44 = 100% precision + 100% recall on Haiku 4.5 across two fixtures. If classification quality is invariant on Haiku for fixture-shaped workloads, then the cost-sensitive default is Haiku — and the benchmark numbers should reflect that. Run on Sonnet or Opus when you want quality-vs-cost comparison data.
Why the full pipeline (not just gap agent). Customers invoke
efterlev report run, which includes init, scan, gap, document, poam,
and OSCAL emit. Benchmarking gap-only undercounts wall-clock by ~30-50%
versus what a customer actually experiences.
Bedrock backend — supported via --backend bedrock. Per AWS pricing
agreements, Bedrock-Haiku is typically ~30% cheaper than Anthropic API
Haiku in commercial regions; published numbers should specify backend.
Fixtures¶
Five fixtures vendored under evals/fixtures/ are appropriate for the
benchmark:
| Fixture | IaC | Resource count (approx) | Notes |
|---|---|---|---|
csp-starter-cfn |
CloudFormation | ~30 resources | v0.1.78 first labeled CFN fixture; Phase 2 lite ground-truth |
aws-vpc-cfn |
CloudFormation | ~40 resources | v0.1.97; AWS Quick Start VPC |
aws-vpc-tfm |
Terraform module | ~50 resources | terraform-aws-modules/vpc |
aws-s3-bucket-tfm |
Terraform module | ~10 resources | terraform-aws-modules/s3-bucket |
aws-lambda-tfm |
Terraform module | ~20 resources | terraform-aws-modules/lambda |
Smaller fixtures (govnotes-v1, iam-dynamic-policy,
encryption-mixed, serverless-mixed, runtime-monitoring) are
useful for unit-test-shape benchmarks but aren't representative of
customer codebases.
Results¶
v0.1.117 — GPT-OSS-120B fails maintainer-validation gate (provisional recommendation retracted)¶
Dispatched 2026-05-15 locally against both labeled CFN fixtures
using evals/cli.py run (the eval-harness path that compares verdicts
to ground-truth labels — distinct from the benchmark which only counts
classification completeness). User decision: focus on
openai.gpt-oss-120b-1:0 for full eval since it's GovCloud-available.
| Fixture | Precision (vs ground truth) | Recall | Haiku 4.5 baseline |
|---|---|---|---|
| csp-starter-cfn | 30.4% (7/23) | 87.5% (7/8) | 100/100 (v0.1.81) |
| aws-vpc-cfn | 23.8% (5/21) | 100% (5/5) | 100/100 (v0.1.98) |
| Combined | ~27% (12/44) | ~92% | 100% |
RETRACTED: The provisional recommendation below ("GPT-OSS-120B =
best cost-quality on Bedrock") was based on the benchmark's classification
COUNT (60/60 verdicts emitted). Maintainer-validation reveals the
verdicts are mostly WRONG — GPT-OSS labels everything-without-evidence
as not_implemented, missing the evidence_layer_inapplicable
distinction that's load-bearing for the procedural KSI family
(AFR / CED / INR themes). Completeness ≠ correctness.
Honest reading¶
- Benchmark "completeness" is necessary but not sufficient. A model can produce 60 verdicts that are all wrong, and the benchmark won't notice. The eval-harness ground-truth comparison is the gate that actually validates classification quality.
- GPT-OSS recall is good (87.5% / 100%) — when it says something IS
implemented or partial, it's usually right. The over-classification
is one-sided: false
not_implementedfor procedural KSIs. - The fix isn't "tune the GPT-OSS prompt" first. efterlev's
evidence_layer_inapplicableverdict is a deliberate concept that reasoning-without-Claude-context-window doesn't reproduce reliably. Could be addressed via prompt tuning, but that's customer-driven work — not the hill to die on when Claude Haiku 4.5 on Bedrock is available + already maintainer-validated at 100/100.
Next: validate Haiku 4.5 on Bedrock (proven model, GovCloud path)¶
Anthropic Claude Haiku 4.5 hit 100/100 precision+recall on these
fixtures via Anthropic API at v0.1.81 + v0.1.98. The same model is
available on Bedrock at the inference profile
us.anthropic.claude-haiku-4-5-20251001-v1:0 — same inference quality,
GovCloud-deployable.
v0.1.118 — Bedrock Claude Haiku 4.5 validated across ALL 5 fixtures (111/112 = 99.1%)¶
Dispatched 2026-05-15 locally against all 5 maintainer-labeled fixtures (2 CFN + 3 TF) on the Bedrock Anthropic endpoint. Extends the v0.1.117 CFN-only validation to the full v0.1.69-scope.
| Fixture | Type | Backend | Precision | Recall | Notes |
|---|---|---|---|---|---|
| csp-starter-cfn (rev 2) | CFN | bedrock | 24/24 (100%) | 100% | |
| aws-vpc-cfn (rev 2) | CFN | bedrock | 21/21 (100%) | 100% | |
| aws-vpc-tfm (rev 3) | TF | bedrock | 21/21 (100%) | 100% | |
| aws-s3-bucket-tfm | TF | bedrock | 25/25 (100%) | 100% | |
| aws-lambda-tfm | TF | bedrock | 20/21 (95.2%) | 100% | 1 over-classification: CNA-MAT (partial vs ground-truth not_implemented; borderline case) |
| Combined | 111/112 (99.1%) | 100% |
Compared to the Anthropic-API baseline (67/67 TF at v0.1.69 + 44/44
CFN at v0.1.81+v0.1.98 = 111/111 = 100%), the Bedrock-Haiku result
of 111/112 = 99.1% is within model-variance band — the single miss is
a borderline partial vs not_implemented over-classification of
KSI-CNA-MAT on aws-lambda-tfm, the kind of one-step drift that
re-labeling or temperature variance can flip either direction.
Substantively equivalent to 100% baseline. The Anthropic-API → Bedrock switch is quality-neutral for Haiku 4.5 across the full labeled-fixture scope.
v0.1.117 — Bedrock Claude Haiku 4.5 validated at 100/100 across both CFN fixtures (superseded by v0.1.118 above)¶
Dispatched 2026-05-15 locally against both labeled CFN fixtures on the Bedrock Anthropic endpoint, immediately after the GPT-OSS regression discovery.
| Fixture | Backend | Model | Precision | Recall |
|---|---|---|---|---|
| csp-starter-cfn (rev 2) | bedrock | us.anthropic.claude-haiku-4-5-20251001-v1:0 | 24/24 (100%) | 100% |
| aws-vpc-cfn (rev 2) | bedrock | us.anthropic.claude-haiku-4-5-20251001-v1:0 | 21/21 (100%) | 100% |
| Combined | 45/45 (100%) | 100% |
Matches and slightly exceeds the Anthropic-API baseline (44/44 across 23+21 labels at v0.1.81+v0.1.98). The 1-label increase on csp-starter-cfn reflects revision 2 of the ground truth adding a label since v0.1.81.
The graduated recommendation¶
| Goal | Bedrock model |
|---|---|
| GovCloud-deployable + maximum classification quality + cheap | us.anthropic.claude-haiku-4-5-20251001-v1:0 |
| GovCloud-deployable + maximum quality | us.anthropic.claude-opus-4-7 (Opus on Bedrock) |
| Speed-optimized testing (with 27% / 98% caveat) | nvidia.nemotron-super-3-120b |
| Remediation Agent (always) | us.anthropic.claude-opus-4-7 |
openai.gpt-oss-120b-1:0 is NOT recommended for Gap Agent
classification — 27% precision rules it out for production use.
Could be revisited if a customer wants to invest in prompt-tuning
the model's recognition of evidence_layer_inapplicable semantics.
Comparison summary across all v0.1.117 evals¶
| Backend / Model | Precision (combined) | Cost per fixture | GovCloud |
|---|---|---|---|
| bedrock / Claude Haiku 4.5 | 45/45 (100%) ✓ | ~$0.40 | ✓ |
| anthropic / Claude Haiku 4.5 (baseline) | 44/44 (100%) | ~$0.30 | ✗ |
| anthropic / Opus 4.7 + Sonnet 4.6 | (assumed 100%) | ~$60 | ✗ |
| bedrock / Nemotron Super 120B | (precision/recall not measured; benchmark missed 1 KSI) | ~$0.40 | ✓ |
| bedrock / GPT-OSS-120B | 12/45 (27%) | ~$0.50 | ✓ |
Cost the user actually paid validating this¶
- csp-starter-cfn × GPT-OSS: $0.47
- aws-vpc-cfn × GPT-OSS: ~$0.50
- csp-starter-cfn × Nemotron (benchmark only): $0.40
- csp-starter-cfn × Bedrock-Haiku: ~$0.40
- aws-vpc-cfn × Bedrock-Haiku: ~$0.40
- Total: ~$2.20 to definitively answer the GovCloud-cheap-model question with maintainer-validated numbers.
v0.1.117 — GPT-OSS-120B fails maintainer-validation gate (provisional recommendation retracted)¶
v0.1.116 — Bedrock model comparison (Nemotron + GPT-OSS, both 120B)¶
Dispatched 2026-05-15 locally against csp-starter-cfn after the
v0.1.115 model-param fix + v0.1.116 agent-backend fix. AWS user in
us-east-1; both Bedrock model accesses pre-enabled. N=1 each — not
statistically meaningful for variance, but enough to surface
classification-quality differences.
| Backend | Model | Wall-clock | Cost | KSIs | Tokens (in/out) |
|---|---|---|---|---|---|
| anthropic | claude-opus-4-7 + sonnet-4-6 fallback (v0.1.114 baseline) | 369.5 s | ~$60 (hand-calc) | 60/60 | 942K + 80K / 606K + 12K |
| bedrock | nvidia.nemotron-super-3-120b | 83.82 s — 4.4× faster | $0.40 — 150× cheaper | 59/60 (dropped KSI-PIY-RIS) | 677K / 452K |
| bedrock | openai.gpt-oss-120b-1:0 | 181.3 s — 2× faster | $0.47 — 127× cheaper | 60/60 ✓ | 732K / 553K |
All three runs produced all artifacts: scan, gap, documentation, POA&M markdown, OSCAL POA&M, OSCAL component-definition.
Key findings¶
1. Both Bedrock 120B models are ~150× cheaper than Opus 4.7. Same cost-order, real customer-facing story for both GovCloud and cost-sensitive deployments.
2. GPT-OSS-120B achieves Opus-class classification completeness
(60/60) at $0.47. Nemotron Super 120B dropped 1 KSI (KSI-PIY-RIS —
Reviewing Investments in Security, a procedural KSI). Worth flagging
as a model-quality observation; GPT-OSS appears stronger here.
3. Nemotron is 2.2× faster than GPT-OSS at similar cost. Reasoning models emit more output tokens (553K vs 452K) which lengthens wall-clock; same per-token rate keeps cost in the same order.
4. Bedrock Converse API works model-agnostically. efterlev's
existing AnthropicBedrockClient invokes Bedrock with provider-
neutral message shape; both NVIDIA and OpenAI models worked without
adapter changes.
Speed vs cost vs quality trade-offs¶
| Goal | Recommended model |
|---|---|
| Maximum classification quality | claude-opus-4-7 on Anthropic API or Bedrock |
| Best cost-quality (Bedrock) | openai.gpt-oss-120b-1:0 — Opus-class completeness, ~127× cheaper |
| Best speed (Bedrock) | nvidia.nemotron-super-3-120b — 4.4× faster than Opus, ~150× cheaper, ~98% completeness |
| Remediation Agent (always) | claude-opus-4-7 — Terraform diff generation needs strongest code reasoning; cost less critical because it's per-KSI on-demand |
Caveats — what's missing¶
- N=1 per cell. Variance unmeasured. Run with
--runs 3for median/p95 once a baseline cost-per-fixture is acceptable. - No precision+recall vs ground-truth labels. Classification
count completeness ≠ classification correctness. Run the
eval-harness-cfn.ymlworkflow with each backend/model combination to get the v0.1.69 / v0.1.81 / v0.1.98-style maintainer-validated precision+recall numbers. Critical for the customer-recommendation table above to graduate from "first impressions" to "validated." - csp-starter-cfn is one fixture shape. Numbers for
aws-vpc-tfm,aws-vpc-cfn,aws-s3-bucket-tfm,aws-lambda-tfmwill vary. Re-run for each fixture once the maintainer-validation against Bedrock models settles. - Documentation Agent narratives not quality-assessed. The documentation step succeeded for all three runs (artifact present), but the actual narrative text quality wasn't compared. A blinded 3PAO review (per v0.1.11 methodology) is the right next step.
v0.1.116 — Nemotron-only first run (kept for diff history)¶
Dispatched 2026-05-15 locally against csp-starter-cfn after
the v0.1.115 model-param fix + v0.1.116 agent-backend fix. AWS user
in us-east-1; Bedrock model access pre-enabled.
| Fixture | Backend | Model | N | Wall-clock | Tokens | Cost | KSIs |
|---|---|---|---|---|---|---|---|
| csp-starter-cfn | bedrock | nvidia.nemotron-super-3-120b | 1 | 83.82 s | 677K in + 452K out | $0.40 | 59/60 |
All artifacts present: scan, gap, documentation, POA&M markdown, OSCAL POA&M, OSCAL component-definition.
Comparison to the v0.1.114 unfixed-harness Opus run¶
| Metric | v0.1.114 (Opus + Sonnet defaults; harness bug) | v0.1.116 (Nemotron, fixed harness) |
|---|---|---|
| Wall-clock | 369.5 s | 83.82 s — 4.4× faster |
| Cost | ~$60 hand-calc | $0.40 — 150× cheaper |
| KSI classifications | 60 | 59 — 1 missed |
The Nemotron MoE architecture (12B active / 120B total) delivers the "7× higher throughput" claim from NVIDIA's spec page in practice.
Quality observation: Nemotron dropped 1 KSI¶
KSI-PIY-RIS — Reviewing Investments in Security — was absent from
Nemotron's Gap Agent output. This is a procedural KSI with controls
about executive support and security investment review (the "soft"
end of the FRMR catalog where LLMs sometimes elide or skip). 98.3%
completeness vs Opus's 100%. Worth flagging in the Nemotron-specific
caveat for the published numbers; not a structural bug, more a model-
quality drift.
For maintainer-validation usage, the eval-harness's per-KSI
precision+recall measurement (the v0.1.69 / v0.1.81 / v0.1.98
methodology) is the right way to measure quality drift, not the
benchmark's classification-count tally. Run eval-harness-cfn.yml
with --llm-backend bedrock --llm-model nvidia.nemotron-super-3-120b
to get the real precision/recall comparison vs Opus + Haiku.
What this means for the GovCloud-deployable customer story¶
Nemotron Super 120B on Bedrock us-east-1 is about 150× cheaper than
Opus 4.7 on Anthropic API for full report run against a typical
fixture, and 4.4× faster wall-clock. With ~98% classification
completeness vs Opus, the cost-quality trade is favorable for:
- CI iteration loops (per-PR runs, drift checks)
- Cost-sensitive customer deployments in GovCloud where Bedrock is the only authorized path anyway
Recommended customer guidance once a maintainer-validation precision/recall measurement lands:
- Default: Anthropic API + Sonnet 4.6 for the full Opus-class classification quality
- Cost-sensitive / GovCloud: Bedrock + Nemotron Super 120B (with documented ~98% classification completeness caveat)
- Remediation Agent stays on Opus 4.7 regardless — Terraform diff generation needs the strongest code reasoning
v0.1.114 — first dispatched numbers (1 run, with caveats)¶
Dispatched 2026-05-15 via benchmark-dispatch.yml run 25913899878
on the v0.1.114 codebase (= the v0.1.112 harness + v0.1.113/v0.1.114
arc additions, no harness change between).
| Fixture | Requested model | Actual models run | N | Wall-clock | Tokens | Estimated cost |
|---|---|---|---|---|---|---|
| csp-starter-cfn | claude-haiku-4-5 | opus-4-7 + sonnet-4-6 (see ⚠️) | 1 | 369.5 s | 942K in + 606K out (Opus) + 80K in + 12K out (Sonnet) | ~$60 hand-calc ($0 reported) |
60 KSI classifications produced. All artifacts present (scan, gap, documentation, OSCAL POA&M, OSCAL CD).
⚠️ Known issues with this run (tracked for v0.1.115 fixes)¶
--model claude-haiku-4-5was a no-op.benchmark.pysetsEFTERLEV_LLM_BACKEND/MODELenv vars but the agents don't read them — they fell back to hardcoded defaults (Gap on Opus 4.7, Documentation on Sonnet 4.6). This run is therefore an Opus + Sonnet measurement, not a Haiku one.- Estimated cost shows $0.
efterlev.llm.pricingdoesn't have entries forclaude-opus-4-7orclaude-sonnet-4-6model IDs; the lookup silently returned 0. Hand-calculated cost from token counts at standard Anthropic Opus + Sonnet pricing: ~$60 for the single run (vs the ~$0.30-1 estimate for Haiku I had in the methodology section). poam_mdartifact reports missing despite the pipeline completing successfully. Benchmark glob (poam-*.md) doesn't match the actual filename pattern. (poam.json + oscal artifacts both present, so the markdown POA&M actually was emitted — this is a detection bug only.)- Artifact upload failed ("No files were found with the provided
path: .benchmark-results"). The
cp -r ${RUNNER_TEMP}/...stage step doesn't actually stage files; the upload-artifact path needs to point directly at${RUNNER_TEMP}/...instead.
Honest reading of this number¶
For the v0.1.114 codebase, full report run against csp-starter-cfn
on the default agent stack (Opus 4.7 + Sonnet 4.6) completes in
~6 minutes wall-clock at ~$60 LLM spend. The Haiku-only
measurement we wanted (target: ~$0.30, ~3-5 min) is blocked on the
v0.1.115 model-param fix; do NOT re-dispatch with intent to measure
Haiku until that bug is fixed.
This is the kind of bug only a real dispatch surfaces. Methodology working as intended.
Pending (after v0.1.115 fixes)¶
| Fixture | Model | Backend | N | Wall-clock (median / p95 / max) | Cost (median / max) | KSIs |
|---|---|---|---|---|---|---|
| csp-starter-cfn | claude-haiku-4-5 | anthropic | 3 | _ | _ | _ |
| csp-starter-cfn | claude-sonnet-4-6 | anthropic | 3 | _ | _ | _ |
| aws-vpc-cfn | claude-haiku-4-5 | anthropic | 3 | _ | _ | _ |
| aws-vpc-tfm | claude-haiku-4-5 | anthropic | 3 | _ | _ | _ |
| aws-s3-bucket-tfm | claude-haiku-4-5 | anthropic | 3 | _ | _ | _ |
| aws-lambda-tfm | claude-haiku-4-5 | anthropic | 3 | _ | _ | _ |
What this number is — and isn't¶
Efterlev's "time-to-FRMR" is a single CLI invocation on the customer's laptop or CI runner. It is not an authorization timeline. Efterlev does not — and cannot — shorten the 3PAO assessment. What it does is take the draft-FRMR step (the artifact a customer hands to a 3PAO) from ~weeks-to-author down to minutes-of-LLM-runtime + reviewer hours.
Reporting discipline:
- Pair every benchmark number with the LIMITATIONS caveat in the same surface (README headline, marketing site, blog post). Don't ship the time without the "this is tool runtime, not authorization" framing — the same discipline LIMITATIONS.md applies to every other Efterlev claim.
- Measure only Efterlev. Report Efterlev's own runtime; don't frame it as a multiple of any other tool's number — the units differ and the comparison would be meaningless.
Refresh cadence¶
Re-run after material changes to:
- The agent prompts (Gap, Documentation, Remediation) — affects token count + classification quality.
- The detector set — affects evidence count → token count.
- The OSCAL output stages (deterministic, but adds wall-clock).
- New labeled fixtures vendored under
evals/fixtures/.
Otherwise: re-run quarterly to track LLM-provider-side latency drift.