Field note2026-05-30 10:44 UTC

From First Slice to Full Pressure Test: Raising the Readiness Bar Across Real Repos

A narrow readiness slice can validate structure, but full trust comes from pressure-testing execution claims, mode symmetry, and agent safety across real repositories.

repo-readiness pressure-testing execution-governance agent-safety

Bobai Kato

Overview

Most readiness rollouts start with a narrow, useful slice. Ours did too:

model one honest contributor path
validate the contract shape
add a CI matrix lane
prove one green run

That is the right start. It is not maturity.

A contract can be structurally valid and still fail where real users and agents operate: task mode branches, cross-OS behavior, runtime probes, transitive task effects, and workflow-level proof semantics.

So we made a hard call: we stopped treating “one green workflow” as proof and moved to full pressure testing across real repositories.

Why The Initial Slice Was Not Enough

The first slice answers a useful question:

Can this repo express one runnable path?

But pressure testing answers the one that actually matters:

Are the declared paths truthful, executable, and safe under real execution conditions?

In practice, the real gaps only surfaced when we executed broader lanes:

non-internal task dry-runs by mode
native vs container path symmetry
cross-OS workflow behavior
safe-task boundary enforcement through transitive dependencies
runtime proof for long-running workflows

Without that expansion, we would have shipped false confidence.

What We Expanded

We raised the bar in five concrete ways, and we enforced them as merge criteria.

1) From workflow-only checks to task-surface checks

We stopped treating workflow success as full proof. If the contract declares runnable tasks, those tasks must be exercised directly too.

VERIFYbash

ota tasks --useota run <task> --dry-run

2) From single-mode assumptions to explicit mode truth

If a task claims native and container support, both paths must be validated explicitly. No exceptions.

VERIFYbash

ota run <task> --mode native --dry-runota run <task> --mode container --dry-run

3) From Linux-only confidence to cross-OS confidence

A path that works on one host can still fail on another due to runtime/tooling surfaces, shell behavior, or unsupported-host semantics. Matrix coverage needs explicit host intent, including unsupported-host proof when that is the correct contract behavior.

4) From direct task safety to transitive task safety

Agent-safe declarations are only reliable if reachable dependencies are also safe with respect to writable/protected boundaries and declared task effects. We now treat transitive safety as mandatory, not optional.

5) From “green enough” to classification discipline

Every failure should be classified as one of:

Ota bug
Ota maturity gap
contract issue
repo baseline issue
CI environment limitation
intentional unsupported path

That prevents pressure tests from devolving into local workarounds and keeps the feedback loop product-grade.

What Changed In Practice

Extending from first slice to full pressure testing changed outcomes immediately:

contracts became more scope-honest
mode claims became explicit and testable
CI lanes became more informative, less noisy
agent boundaries became stricter and less ambiguous
Ota product gaps surfaced early enough to fix before PR review

The key lesson is simple and opinionated:

A valid contract is a starting condition. A pressure-tested contract is an operational asset.

Suggested Pressure-Test Baseline

For teams adopting Ota now, this is the baseline I recommend for strong signal without unnecessary complexity:

VERIFYbash

ota validateota doctorota tasks --useota tasks --safe --useota execution topology --json

Then add matrix lanes that prove:

native dry-run coverage for declared runnable tasks
container dry-run coverage where mode support is declared
runtime proof for workflows that claim live readiness surfaces
explicit unsupported-host checks where host constraints are intentional

This is the minimum bar that turned “works in one path” into “trustworthy across real usage” for us.

Take action

Get started Open reference Check readiness rules