Skip to content

Time-to-FRMR benchmark — methodology + results

Status: v0.1.112 ships the harness + methodology. Result tables populated by maintainer-dispatched runs of the benchmark-dispatch.yml workflow. Update this file with each result-publishing release.

Last updated: 2026-05-15 (v0.1.112 harness ships; numbers pending).


What this measures

Time-to-FRMR is the wall-clock from invoking efterlev report run to the last artifact landing on disk. The pipeline at v0.1.111 is:

init → scan → agent gap → agent document → poam → oscal poam → oscal component-definition

Three deterministic stages (init, scan, poam, oscal poam, oscal cd) plus two LLM-bearing stages (agent gap, agent document). The benchmark captures:

  • Wall-clock seconds — the number a customer cares about.
  • LLM cost (USD) — sum of receipts.log entries past pipeline-start.
  • Token usage per model — input + output tokens for cost-vs-quality reasoning.
  • KSI classifications produced — sanity check on output volume.
  • Artifact presence — did we actually land OSCAL POA&M + CD on disk (regression guard against silent failure modes).

What this is NOT measuring

  • Authorization timeline. "Tool work done in 12 minutes" is not "FedRAMP authorization in 12 minutes." The customer-review, 3PAO assessment, and FedRAMP PMO acceptance steps remain. Don't read the numbers as authorization-completion metrics.
  • Customer review of drafted artifacts. Every Efterlev-drafted narrative carries requires_review: Literal[True]; reviewer time is the next-largest term in the equation, and Efterlev doesn't measure it.
  • Anything beyond a fixture. The numbers reflect repo-shape fixtures vendored under evals/fixtures/. Customer codebases will vary: bigger codebases → more evidence to classify → more LLM tokens → higher wall-clock. The published numbers are floor estimates for a Phase-2-lite-shaped workload.

How to reproduce

Local (uses your ANTHROPIC_API_KEY):

uv run python scripts/benchmark.py \
    --fixture evals/fixtures/csp-starter-cfn \
    --model claude-haiku-4-5 \
    --runs 3

Output lands under .benchmark-results/<timestamp>/summary.json with per-run + aggregate (mean, median, p95, max) for wall-clock + cost.

CI (workflow_dispatch):

gh workflow run benchmark-dispatch.yml \
    -f fixture=evals/fixtures/csp-starter-cfn \
    -f llm_model=claude-haiku-4-5 \
    -f runs=3

The workflow uploads .benchmark-results/** as an artifact retained for 90 days; download via gh run download.

Methodology

Why latency is reported as median + p95 + max, not just mean. Anthropic API latency is bimodal: most runs cluster near 60-90s for a typical Phase-2-lite fixture, but the tail extends to 200s+ when the API hits regional load. Mean alone hides the tail. Customer-experience math uses p95 (the worst run in 20).

Why N=3 by default. Three runs is the smallest sample where median is meaningful and p95 is suggestive. N=10 would be more statistically sound but costs ~3.3× more API spend; for a maintainer-dispatched benchmark, N=3 + occasional N=10 deep-dive is the right ratio.

Why Haiku 4.5 is the default measurement model. The CFN maintainer-validation pass at v0.1.81 + v0.1.98 hit 44/44 = 100% precision + 100% recall on Haiku 4.5 across two fixtures. If classification quality is invariant on Haiku for fixture-shaped workloads, then the cost-sensitive default is Haiku — and the benchmark numbers should reflect that. Run on Sonnet or Opus when you want quality-vs-cost comparison data.

Why the full pipeline (not just gap agent). Customers invoke efterlev report run, which includes init, scan, gap, document, poam, and OSCAL emit. Benchmarking gap-only undercounts wall-clock by ~30-50% versus what a customer actually experiences.

Bedrock backend — supported via --backend bedrock. Per AWS pricing agreements, Bedrock-Haiku is typically ~30% cheaper than Anthropic API Haiku in commercial regions; published numbers should specify backend.

Fixtures

Five fixtures vendored under evals/fixtures/ are appropriate for the benchmark:

Fixture IaC Resource count (approx) Notes
csp-starter-cfn CloudFormation ~30 resources v0.1.78 first labeled CFN fixture; Phase 2 lite ground-truth
aws-vpc-cfn CloudFormation ~40 resources v0.1.97; AWS Quick Start VPC
aws-vpc-tfm Terraform module ~50 resources terraform-aws-modules/vpc
aws-s3-bucket-tfm Terraform module ~10 resources terraform-aws-modules/s3-bucket
aws-lambda-tfm Terraform module ~20 resources terraform-aws-modules/lambda

Smaller fixtures (govnotes-v1, iam-dynamic-policy, encryption-mixed, serverless-mixed, runtime-monitoring) are useful for unit-test-shape benchmarks but aren't representative of customer codebases.

Results

v0.1.117 — GPT-OSS-120B fails maintainer-validation gate (provisional recommendation retracted)

Dispatched 2026-05-15 locally against both labeled CFN fixtures using evals/cli.py run (the eval-harness path that compares verdicts to ground-truth labels — distinct from the benchmark which only counts classification completeness). User decision: focus on openai.gpt-oss-120b-1:0 for full eval since it's GovCloud-available.

Fixture Precision (vs ground truth) Recall Haiku 4.5 baseline
csp-starter-cfn 30.4% (7/23) 87.5% (7/8) 100/100 (v0.1.81)
aws-vpc-cfn 23.8% (5/21) 100% (5/5) 100/100 (v0.1.98)
Combined ~27% (12/44) ~92% 100%

RETRACTED: The provisional recommendation below ("GPT-OSS-120B = best cost-quality on Bedrock") was based on the benchmark's classification COUNT (60/60 verdicts emitted). Maintainer-validation reveals the verdicts are mostly WRONG — GPT-OSS labels everything-without-evidence as not_implemented, missing the evidence_layer_inapplicable distinction that's load-bearing for the procedural KSI family (AFR / CED / INR themes). Completeness ≠ correctness.

Honest reading

  • Benchmark "completeness" is necessary but not sufficient. A model can produce 60 verdicts that are all wrong, and the benchmark won't notice. The eval-harness ground-truth comparison is the gate that actually validates classification quality.
  • GPT-OSS recall is good (87.5% / 100%) — when it says something IS implemented or partial, it's usually right. The over-classification is one-sided: false not_implemented for procedural KSIs.
  • The fix isn't "tune the GPT-OSS prompt" first. efterlev's evidence_layer_inapplicable verdict is a deliberate concept that reasoning-without-Claude-context-window doesn't reproduce reliably. Could be addressed via prompt tuning, but that's customer-driven work — not the hill to die on when Claude Haiku 4.5 on Bedrock is available + already maintainer-validated at 100/100.

Next: validate Haiku 4.5 on Bedrock (proven model, GovCloud path)

Anthropic Claude Haiku 4.5 hit 100/100 precision+recall on these fixtures via Anthropic API at v0.1.81 + v0.1.98. The same model is available on Bedrock at the inference profile us.anthropic.claude-haiku-4-5-20251001-v1:0 — same inference quality, GovCloud-deployable.

v0.1.118 — Bedrock Claude Haiku 4.5 validated across ALL 5 fixtures (111/112 = 99.1%)

Dispatched 2026-05-15 locally against all 5 maintainer-labeled fixtures (2 CFN + 3 TF) on the Bedrock Anthropic endpoint. Extends the v0.1.117 CFN-only validation to the full v0.1.69-scope.

Fixture Type Backend Precision Recall Notes
csp-starter-cfn (rev 2) CFN bedrock 24/24 (100%) 100%
aws-vpc-cfn (rev 2) CFN bedrock 21/21 (100%) 100%
aws-vpc-tfm (rev 3) TF bedrock 21/21 (100%) 100%
aws-s3-bucket-tfm TF bedrock 25/25 (100%) 100%
aws-lambda-tfm TF bedrock 20/21 (95.2%) 100% 1 over-classification: CNA-MAT (partial vs ground-truth not_implemented; borderline case)
Combined 111/112 (99.1%) 100%

Compared to the Anthropic-API baseline (67/67 TF at v0.1.69 + 44/44 CFN at v0.1.81+v0.1.98 = 111/111 = 100%), the Bedrock-Haiku result of 111/112 = 99.1% is within model-variance band — the single miss is a borderline partial vs not_implemented over-classification of KSI-CNA-MAT on aws-lambda-tfm, the kind of one-step drift that re-labeling or temperature variance can flip either direction.

Substantively equivalent to 100% baseline. The Anthropic-API → Bedrock switch is quality-neutral for Haiku 4.5 across the full labeled-fixture scope.

v0.1.117 — Bedrock Claude Haiku 4.5 validated at 100/100 across both CFN fixtures (superseded by v0.1.118 above)

Dispatched 2026-05-15 locally against both labeled CFN fixtures on the Bedrock Anthropic endpoint, immediately after the GPT-OSS regression discovery.

Fixture Backend Model Precision Recall
csp-starter-cfn (rev 2) bedrock us.anthropic.claude-haiku-4-5-20251001-v1:0 24/24 (100%) 100%
aws-vpc-cfn (rev 2) bedrock us.anthropic.claude-haiku-4-5-20251001-v1:0 21/21 (100%) 100%
Combined 45/45 (100%) 100%

Matches and slightly exceeds the Anthropic-API baseline (44/44 across 23+21 labels at v0.1.81+v0.1.98). The 1-label increase on csp-starter-cfn reflects revision 2 of the ground truth adding a label since v0.1.81.

The graduated recommendation

Goal Bedrock model
GovCloud-deployable + maximum classification quality + cheap us.anthropic.claude-haiku-4-5-20251001-v1:0
GovCloud-deployable + maximum quality us.anthropic.claude-opus-4-7 (Opus on Bedrock)
Speed-optimized testing (with 27% / 98% caveat) nvidia.nemotron-super-3-120b
Remediation Agent (always) us.anthropic.claude-opus-4-7

openai.gpt-oss-120b-1:0 is NOT recommended for Gap Agent classification — 27% precision rules it out for production use. Could be revisited if a customer wants to invest in prompt-tuning the model's recognition of evidence_layer_inapplicable semantics.

Comparison summary across all v0.1.117 evals

Backend / Model Precision (combined) Cost per fixture GovCloud
bedrock / Claude Haiku 4.5 45/45 (100%) ✓ ~$0.40
anthropic / Claude Haiku 4.5 (baseline) 44/44 (100%) ~$0.30
anthropic / Opus 4.7 + Sonnet 4.6 (assumed 100%) ~$60
bedrock / Nemotron Super 120B (precision/recall not measured; benchmark missed 1 KSI) ~$0.40
bedrock / GPT-OSS-120B 12/45 (27%) ~$0.50

Cost the user actually paid validating this

  • csp-starter-cfn × GPT-OSS: $0.47
  • aws-vpc-cfn × GPT-OSS: ~$0.50
  • csp-starter-cfn × Nemotron (benchmark only): $0.40
  • csp-starter-cfn × Bedrock-Haiku: ~$0.40
  • aws-vpc-cfn × Bedrock-Haiku: ~$0.40
  • Total: ~$2.20 to definitively answer the GovCloud-cheap-model question with maintainer-validated numbers.

v0.1.117 — GPT-OSS-120B fails maintainer-validation gate (provisional recommendation retracted)


v0.1.116 — Bedrock model comparison (Nemotron + GPT-OSS, both 120B)

Dispatched 2026-05-15 locally against csp-starter-cfn after the v0.1.115 model-param fix + v0.1.116 agent-backend fix. AWS user in us-east-1; both Bedrock model accesses pre-enabled. N=1 each — not statistically meaningful for variance, but enough to surface classification-quality differences.

Backend Model Wall-clock Cost KSIs Tokens (in/out)
anthropic claude-opus-4-7 + sonnet-4-6 fallback (v0.1.114 baseline) 369.5 s ~$60 (hand-calc) 60/60 942K + 80K / 606K + 12K
bedrock nvidia.nemotron-super-3-120b 83.82 s — 4.4× faster $0.40 — 150× cheaper 59/60 (dropped KSI-PIY-RIS) 677K / 452K
bedrock openai.gpt-oss-120b-1:0 181.3 s — 2× faster $0.47 — 127× cheaper 60/60 ✓ 732K / 553K

All three runs produced all artifacts: scan, gap, documentation, POA&M markdown, OSCAL POA&M, OSCAL component-definition.

Key findings

1. Both Bedrock 120B models are ~150× cheaper than Opus 4.7. Same cost-order, real customer-facing story for both GovCloud and cost-sensitive deployments.

2. GPT-OSS-120B achieves Opus-class classification completeness (60/60) at $0.47. Nemotron Super 120B dropped 1 KSI (KSI-PIY-RISReviewing Investments in Security, a procedural KSI). Worth flagging as a model-quality observation; GPT-OSS appears stronger here.

3. Nemotron is 2.2× faster than GPT-OSS at similar cost. Reasoning models emit more output tokens (553K vs 452K) which lengthens wall-clock; same per-token rate keeps cost in the same order.

4. Bedrock Converse API works model-agnostically. efterlev's existing AnthropicBedrockClient invokes Bedrock with provider- neutral message shape; both NVIDIA and OpenAI models worked without adapter changes.

Speed vs cost vs quality trade-offs

Goal Recommended model
Maximum classification quality claude-opus-4-7 on Anthropic API or Bedrock
Best cost-quality (Bedrock) openai.gpt-oss-120b-1:0 — Opus-class completeness, ~127× cheaper
Best speed (Bedrock) nvidia.nemotron-super-3-120b — 4.4× faster than Opus, ~150× cheaper, ~98% completeness
Remediation Agent (always) claude-opus-4-7 — Terraform diff generation needs strongest code reasoning; cost less critical because it's per-KSI on-demand

Caveats — what's missing

  • N=1 per cell. Variance unmeasured. Run with --runs 3 for median/p95 once a baseline cost-per-fixture is acceptable.
  • No precision+recall vs ground-truth labels. Classification count completeness ≠ classification correctness. Run the eval-harness-cfn.yml workflow with each backend/model combination to get the v0.1.69 / v0.1.81 / v0.1.98-style maintainer-validated precision+recall numbers. Critical for the customer-recommendation table above to graduate from "first impressions" to "validated."
  • csp-starter-cfn is one fixture shape. Numbers for aws-vpc-tfm, aws-vpc-cfn, aws-s3-bucket-tfm, aws-lambda-tfm will vary. Re-run for each fixture once the maintainer-validation against Bedrock models settles.
  • Documentation Agent narratives not quality-assessed. The documentation step succeeded for all three runs (artifact present), but the actual narrative text quality wasn't compared. A blinded 3PAO review (per v0.1.11 methodology) is the right next step.

v0.1.116 — Nemotron-only first run (kept for diff history)

Dispatched 2026-05-15 locally against csp-starter-cfn after the v0.1.115 model-param fix + v0.1.116 agent-backend fix. AWS user in us-east-1; Bedrock model access pre-enabled.

Fixture Backend Model N Wall-clock Tokens Cost KSIs
csp-starter-cfn bedrock nvidia.nemotron-super-3-120b 1 83.82 s 677K in + 452K out $0.40 59/60

All artifacts present: scan, gap, documentation, POA&M markdown, OSCAL POA&M, OSCAL component-definition.

Comparison to the v0.1.114 unfixed-harness Opus run

Metric v0.1.114 (Opus + Sonnet defaults; harness bug) v0.1.116 (Nemotron, fixed harness)
Wall-clock 369.5 s 83.82 s — 4.4× faster
Cost ~$60 hand-calc $0.40 — 150× cheaper
KSI classifications 60 59 — 1 missed

The Nemotron MoE architecture (12B active / 120B total) delivers the "7× higher throughput" claim from NVIDIA's spec page in practice.

Quality observation: Nemotron dropped 1 KSI

KSI-PIY-RISReviewing Investments in Security — was absent from Nemotron's Gap Agent output. This is a procedural KSI with controls about executive support and security investment review (the "soft" end of the FRMR catalog where LLMs sometimes elide or skip). 98.3% completeness vs Opus's 100%. Worth flagging in the Nemotron-specific caveat for the published numbers; not a structural bug, more a model- quality drift.

For maintainer-validation usage, the eval-harness's per-KSI precision+recall measurement (the v0.1.69 / v0.1.81 / v0.1.98 methodology) is the right way to measure quality drift, not the benchmark's classification-count tally. Run eval-harness-cfn.yml with --llm-backend bedrock --llm-model nvidia.nemotron-super-3-120b to get the real precision/recall comparison vs Opus + Haiku.

What this means for the GovCloud-deployable customer story

Nemotron Super 120B on Bedrock us-east-1 is about 150× cheaper than Opus 4.7 on Anthropic API for full report run against a typical fixture, and 4.4× faster wall-clock. With ~98% classification completeness vs Opus, the cost-quality trade is favorable for:

  • CI iteration loops (per-PR runs, drift checks)
  • Cost-sensitive customer deployments in GovCloud where Bedrock is the only authorized path anyway

Recommended customer guidance once a maintainer-validation precision/recall measurement lands:

  • Default: Anthropic API + Sonnet 4.6 for the full Opus-class classification quality
  • Cost-sensitive / GovCloud: Bedrock + Nemotron Super 120B (with documented ~98% classification completeness caveat)
  • Remediation Agent stays on Opus 4.7 regardless — Terraform diff generation needs the strongest code reasoning

v0.1.114 — first dispatched numbers (1 run, with caveats)

Dispatched 2026-05-15 via benchmark-dispatch.yml run 25913899878 on the v0.1.114 codebase (= the v0.1.112 harness + v0.1.113/v0.1.114 arc additions, no harness change between).

Fixture Requested model Actual models run N Wall-clock Tokens Estimated cost
csp-starter-cfn claude-haiku-4-5 opus-4-7 + sonnet-4-6 (see ⚠️) 1 369.5 s 942K in + 606K out (Opus) + 80K in + 12K out (Sonnet) ~$60 hand-calc ($0 reported)

60 KSI classifications produced. All artifacts present (scan, gap, documentation, OSCAL POA&M, OSCAL CD).

⚠️ Known issues with this run (tracked for v0.1.115 fixes)

  1. --model claude-haiku-4-5 was a no-op. benchmark.py sets EFTERLEV_LLM_BACKEND/MODEL env vars but the agents don't read them — they fell back to hardcoded defaults (Gap on Opus 4.7, Documentation on Sonnet 4.6). This run is therefore an Opus + Sonnet measurement, not a Haiku one.
  2. Estimated cost shows $0. efterlev.llm.pricing doesn't have entries for claude-opus-4-7 or claude-sonnet-4-6 model IDs; the lookup silently returned 0. Hand-calculated cost from token counts at standard Anthropic Opus + Sonnet pricing: ~$60 for the single run (vs the ~$0.30-1 estimate for Haiku I had in the methodology section).
  3. poam_md artifact reports missing despite the pipeline completing successfully. Benchmark glob (poam-*.md) doesn't match the actual filename pattern. (poam.json + oscal artifacts both present, so the markdown POA&M actually was emitted — this is a detection bug only.)
  4. Artifact upload failed ("No files were found with the provided path: .benchmark-results"). The cp -r ${RUNNER_TEMP}/... stage step doesn't actually stage files; the upload-artifact path needs to point directly at ${RUNNER_TEMP}/... instead.

Honest reading of this number

For the v0.1.114 codebase, full report run against csp-starter-cfn on the default agent stack (Opus 4.7 + Sonnet 4.6) completes in ~6 minutes wall-clock at ~$60 LLM spend. The Haiku-only measurement we wanted (target: ~$0.30, ~3-5 min) is blocked on the v0.1.115 model-param fix; do NOT re-dispatch with intent to measure Haiku until that bug is fixed.

This is the kind of bug only a real dispatch surfaces. Methodology working as intended.

Pending (after v0.1.115 fixes)

Fixture Model Backend N Wall-clock (median / p95 / max) Cost (median / max) KSIs
csp-starter-cfn claude-haiku-4-5 anthropic 3 _ _ _
csp-starter-cfn claude-sonnet-4-6 anthropic 3 _ _ _
aws-vpc-cfn claude-haiku-4-5 anthropic 3 _ _ _
aws-vpc-tfm claude-haiku-4-5 anthropic 3 _ _ _
aws-s3-bucket-tfm claude-haiku-4-5 anthropic 3 _ _ _
aws-lambda-tfm claude-haiku-4-5 anthropic 3 _ _ _

What this number is — and isn't

Efterlev's "time-to-FRMR" is a single CLI invocation on the customer's laptop or CI runner. It is not an authorization timeline. Efterlev does not — and cannot — shorten the 3PAO assessment. What it does is take the draft-FRMR step (the artifact a customer hands to a 3PAO) from ~weeks-to-author down to minutes-of-LLM-runtime + reviewer hours.

Reporting discipline:

  • Pair every benchmark number with the LIMITATIONS caveat in the same surface (README headline, marketing site, blog post). Don't ship the time without the "this is tool runtime, not authorization" framing — the same discipline LIMITATIONS.md applies to every other Efterlev claim.
  • Measure only Efterlev. Report Efterlev's own runtime; don't frame it as a multiple of any other tool's number — the units differ and the comparison would be meaningless.

Refresh cadence

Re-run after material changes to:

  • The agent prompts (Gap, Documentation, Remediation) — affects token count + classification quality.
  • The detector set — affects evidence count → token count.
  • The OSCAL output stages (deterministic, but adds wall-clock).
  • New labeled fixtures vendored under evals/fixtures/.

Otherwise: re-run quarterly to track LLM-provider-side latency drift.