Performance baseline

Reliability metrics and release posture

This document records the current measurable baseline for end-to-end-loop. It is deliberately conservative: numbers describe artifacts that exist now, not aspirational benchmark results.

Baseline snapshot

20seed trigger cases
4manual outcome scenarios
6core loop phases plus iteration gates
1deploy-readiness rubric

What is already measurable

Quality dimensions

Correctness

Task success, relevant tests, fail-to-pass behavior, pass-to-pass preservation, and build/package success where applicable.

Evidence quality

Commands run, outputs observed, verification gaps named, and limitations visible in the final report.

Safety

No live deploy without opt-in; no secrets committed; no auth/rules/data changes without explicit scope.

Efficiency

Iteration count, tool calls, files touched, diff size, wall time, and human intervention count.

Current release posture

Status: private development, evaluation-backed release path. The repo is approaching a v0.3-style baseline because it now contains seed trigger cases, outcome scenarios, a scoring rubric, and deploy-readiness guidance.

Not yet v1.0: the trigger cases and outcome scenarios still need to be run across independent agent contexts. Install instructions and adapter examples also need public-release polish.

Next empirical targets

  1. Run all 20 trigger cases and record true positive / true negative / false positive / false negative results.
  2. Run the four outcome scenarios with at least one fresh agent context.
  3. Record one deploy-block result and one CAVEMAN-block result.
  4. Publish a compact result log with pass/fail evidence and known limitations.