end-to-end-loop — evidence-backed AI coding agents

What it is

A self-learning delivery for-loop for AI coding agents. It keeps the agent inside a disciplined workflow until a code component or application change is understood, implemented, verified, tested, and safely handed off.

Who it is for

AI researchers, software builders, and power users who want reliable agentic development: less skipped context, fewer fake test claims, clearer deploy gates, and better operational reports.

What it changes

It turns “agent says it is done” into an auditable artifact: planned scope, changed files, commands run, results, risks, limitations, and the next recommended action.

Before

Agent work that feels done, until you check.

Edits happen before the repo is understood.
Tests are skipped, weakened, or claimed without proof.
Deploy becomes a reflex instead of a governed decision.
Final reports hide uncertainty, failed checks, and risk.

After

A loop that forces evidence before confidence.

Discovery happens before implementation.
Every green claim needs observed output.
Deploy requires opt-in, CI, rollback, smoke and security checks.
Reports show changes, commands, risks, and next action.

Works with your agent stack

One core discipline, multiple agent adapters.

Codex Hermes Agent Claude Code Cursor AGENTS.md

The product

A delivery loop, not a prompt vibe.

Most coding-agent failures are boring and repeatable: edit too early, skip reproduction, forget tests, hide uncertainty, or deploy without a rollback story. end-to-end-loop makes those failure modes explicit gates.

01

Discover

Clarify outcome, constraints, repo state, side effects, credentials, and risks.

02

Plan

Define small steps, acceptance criteria, test strategy, and delivery target.

03

Execute

Make scoped changes through the required CAVEMAN/Cavekit lane for code-producing work.

04

Verify

Prove behavior with observed evidence: commands, tests, diff review, or manual checks.

05

Test & review

Run relevant automated checks, smoke paths, and security review proportional to risk.

06

Deliver / report

Commit, PR, artifact, readiness report, or approved deploy — with limitations named.

Evidence-backed reports

“Done” means the proof is visible.

Every completed task should leave an audit trail that a human can inspect: changed files, commands run, pass/fail results, known limitations, and the next recommended action.

Changed:
- src/auth/session.ts
- tests/auth/session.test.ts

Verified:
- npm test -- session.test.ts: PASS
- npm run lint: PASS
- Manual smoke: login/logout checked

Risks:
- OAuth edge cases need staging coverage

Next:
- Add staging smoke before production deploy

Safety model

Deploy is not the default ending.

CAVEMAN hard gateCode-producing execution must use the configured CAVEMAN/Cavekit lane or stop for an explicit exception.

Observed evidenceNo claims of green without command output, CI result, diff review, smoke test, or approval record.

Live deploy opt-inProduction deploy requires explicit approval, green/applicable CI, rollback, smoke/security checks, and credentials approval.

Risk-based ceremonySmall docs changes stay light. Auth, data, dependencies, and deploy paths get stronger gates.

Research and performance documents

Product evidence, not office plumbing.

WhitepaperToward a Universal End-to-End Loop Skill

The research thesis, design implications, limitations, and artifact architecture.

Performance baselineReliability metrics and release posture

Current measurable baseline: trigger coverage, outcome scenarios, validation gates, and gaps.

Evaluation protocolHow end-to-end-loop should be tested

Trigger accuracy, loop compliance, deploy safety, CAVEMAN behavior, and result schema.

Deploy readinessWhen agents may ship live changes

A pass/fail checklist for external writes, CI, rollback, smoke paths, and approvals.

Current status

Private development, evaluation-backed release path.

Product baseline active

README, evaluation rubric, trigger cases, outcome scenarios, and deploy readiness docs are in place.

Public release later

The skill stays private until docs, metrics, evals, and install examples are strong enough.

Site focus

dev-boss.nl is now a lean product site for end-to-end-loop and its research/performance artifacts.