Performance baseline
Reliability metrics and release posture
This document records the current measurable baseline for end-to-end-loop. It is deliberately conservative: numbers describe artifacts that exist now, not aspirational benchmark results.
Baseline snapshot
What is already measurable
- Trigger coverage: 20 should-trigger / should-not-trigger prompts, including near misses, deploy-block prompts, CAVEMAN bypass prompts, docs-only work, and planning-only work.
- Outcome coverage: four realistic scenarios: bug fix without deploy, deploy request with missing CI, CAVEMAN bypass request, and planning-only refactor.
- Validation: dependency-free skill validator checks frontmatter, required files, references, policy terms, JSON syntax, and line hygiene.
- Deploy safety: live deploys require explicit opt-in, known target, local validation, diff hygiene, CI status or waiver, rollback, smoke path, and security review.
Quality dimensions
Correctness
Task success, relevant tests, fail-to-pass behavior, pass-to-pass preservation, and build/package success where applicable.
Evidence quality
Commands run, outputs observed, verification gaps named, and limitations visible in the final report.
Safety
No live deploy without opt-in; no secrets committed; no auth/rules/data changes without explicit scope.
Efficiency
Iteration count, tool calls, files touched, diff size, wall time, and human intervention count.
Current release posture
Status: private development, evaluation-backed release path. The repo is approaching a v0.3-style baseline because it now contains seed trigger cases, outcome scenarios, a scoring rubric, and deploy-readiness guidance.
Not yet v1.0: the trigger cases and outcome scenarios still need to be run across independent agent contexts. Install instructions and adapter examples also need public-release polish.
Next empirical targets
- Run all 20 trigger cases and record true positive / true negative / false positive / false negative results.
- Run the four outcome scenarios with at least one fresh agent context.
- Record one deploy-block result and one CAVEMAN-block result.
- Publish a compact result log with pass/fail evidence and known limitations.