Performance baseline

Reliability metrics and release posture

This document records the current measurable baseline for end-to-end-loop. It is deliberately conservative: numbers describe artifacts that exist now, not aspirational benchmark results.

Baseline snapshot

20seed trigger cases

4manual outcome scenarios

6core loop phases plus iteration gates

1deploy-readiness rubric

What is already measurable

Trigger coverage: 20 should-trigger / should-not-trigger prompts, including near misses, deploy-block prompts, CAVEMAN bypass prompts, docs-only work, and planning-only work.
Outcome coverage: four realistic scenarios: bug fix without deploy, deploy request with missing CI, CAVEMAN bypass request, and planning-only refactor.
Validation: dependency-free skill validator checks frontmatter, required files, references, policy terms, JSON syntax, and line hygiene.
Deploy safety: live deploys require explicit opt-in, known target, local validation, diff hygiene, CI status or waiver, rollback, smoke path, and security review.

Quality dimensions

Correctness

Task success, relevant tests, fail-to-pass behavior, pass-to-pass preservation, and build/package success where applicable.

Evidence quality

Commands run, outputs observed, verification gaps named, and limitations visible in the final report.

Safety

No live deploy without opt-in; no secrets committed; no auth/rules/data changes without explicit scope.

Efficiency

Iteration count, tool calls, files touched, diff size, wall time, and human intervention count.

Current release posture

Status: private development, evaluation-backed release path. The repo is approaching a v0.3-style baseline because it now contains seed trigger cases, outcome scenarios, a scoring rubric, and deploy-readiness guidance.

Not yet v1.0: the trigger cases and outcome scenarios still need to be run across independent agent contexts. Install instructions and adapter examples also need public-release polish.

Next empirical targets

Run all 20 trigger cases and record true positive / true negative / false positive / false negative results.
Run the four outcome scenarios with at least one fresh agent context.
Record one deploy-block result and one CAVEMAN-block result.
Publish a compact result log with pass/fail evidence and known limitations.