Engineering note2026-06-01 10:15 UTC

Pressure-testing Ota on OpenHands: from setup fragmentation to execution governance

How OpenHands pressure-testing hardened Ota’s contract and matrix behavior across native, container, and runtime-proof lanes.

pressure-testing openhands repo-readiness execution-governance

Bobai Kato

Overview

OpenHands was a useful repo to pressure-test because it is not a toy. It has a Python backend, a frontend dev server, a standalone openhands-ui package, a documented Docker runtime path, and enough setup branching to create false confidence if you only read the docs.

You can usually get OpenHands working by reading Development.md carefully and filling the gaps from habit. That is not the bar. The bar is whether local development, CI, and an agent can choose the same path without guessing.

That was the real question behind this pressure test:

Can one contract make path selection explicit enough that a human, CI job, and coding agent all hit the same operating surface on purpose?

Why OpenHands was a strong test

OpenHands already has strong docs. The problem was not missing documentation. The problem was that the repo's real operating model was spread across too many places:

Development.md explained the happy path
Makefile carried real execution behavior
workflow YAML encoded what CI actually trusted
maintainer memory filled in the rest

That split matters because OpenHands has multiple valid ways to run:

a host-native contributor path with Python, Node, Poetry, and INSTALL_DOCKER=0
a documented Docker wrapper path through make docker-run
a packaged container path that is better modeled as a first-class container service
a separate openhands-ui package that uses Bun instead of the main app toolchain

That is exactly where readiness drift starts. Nothing is obviously broken, but the repo still makes people infer too much.

Where the repo actually fought back

Three things made this a real test instead of a cosmetic one.

First, "run the app" was not one thing. The host-native GUI loop and the Docker runtime loop are both legitimate, but they serve different needs. If they are collapsed into one vague app workflow, the contract becomes decorative instead of operational.

Second, openhands-ui is not just "more frontend." It has its own Bun-based build surface and lockfile. Treating it like part of the main frontend/ path would make the contract easier to write and less truthful.

Third, some of the repo's important assumptions live in environment toggles and context, not just task names. INSTALL_DOCKER=0, RUNTIME=local, host-only constraints, and container execution mode are all easy for a maintainer to remember and easy for an agent to miss.

That is the class of repo where Ota has to do more than list commands. It has to make execution intent legible.

What changed

The result was a narrower but much more explicit readiness contract:

contexts that separate host-dev, docker-host, host-ui, and the ephemeral app container path
workflows for app, backend, frontend, app:docker, app:container, and ui-package
explicit surfaces for backend, frontend, and packaged-web
a matrix that checks contract validity, dry-run behavior, real task execution, and runtime proof without pretending those are the same signal
pinned Ota install and contract minimum version to the exact Ota 1.6.20 feature level the branch depends on, so the matrix now points at the released 1.6.20 line directly

The contract decisions that mattered

The important part was not "we added more workflows." The important part was drawing boundaries that match how the repo actually behaves.

app is the canonical local GUI workflow. It follows the documented self-development path and exposes backend and frontend readiness separately.
app:docker stays tied to the repo's own make docker-run wrapper. That matters because it preserves the documented runtime path instead of replacing it with Ota opinion.
app:container models the packaged image as a direct Ota-managed container service. That gives a cleaner surface when you want the packaged runtime itself, not the repo wrapper around it.
ui-package stands on its own because the Bun-based openhands-ui package is a distinct build surface, not just an implementation detail of the main frontend loop.

Those distinctions look fussy until you try to automate the repo. Then they are the difference between "agent can run something" and "agent can choose the right thing."

Before and after

Before:

follow docs and infer which path you actually mean
manually reconcile host-native, Docker-wrapper, and packaged-container behavior
treat CI workflow behavior as proof of intent after the fact
rely on maintainer memory for which tasks are safe to dry-run, safe to execute, or meaningful for agents

After:

choose a declared workflow that already encodes the intended path
dry-run that path before mutating anything
validate readiness through named surfaces instead of log inspection
keep agent execution bounded by declared tasks, modes, and safety labels

Why the matrix had to be opinionated

One of the most useful outcomes here was not in ota.yaml. It was in the matrix design.

The generic "execute non-internal non-runtime tasks" loop intentionally excludes category: test.

Why:

OpenHands has aggregate test surfaces that can fail for reasons unrelated to basic repo readiness
the generic execution lane should prove that the contract is runnable, not collapse into a catch-all CI job
deterministic signal is more valuable than noisy completeness in a pressure matrix
tests still run in dedicated verification lanes like lint:backend, lint:frontend, and test:backend

That split matters. If every lane means everything, the matrix stops teaching you anything. Here the lanes do different jobs on purpose: validation, dry-run topology, safe execution coverage, and runtime proof.

Takeaway

OpenHands was useful because it forced Ota to be specific.

The repo already had commands. It already had docs. It already had CI. What it did not have was one machine-checkable contract that said:

this is the native contributor path
this is the documented Docker path
this is the packaged container path
this is the separate UI package path
these are the surfaces that prove each one is actually ready

That is the difference between command wrapping and execution governance.

OpenHands did not just give Ota a nice demo repo. It forced the contract to carry real operational distinctions that contributors and agents usually keep in their heads. That is exactly the kind of pressure test Ota needs.

Take action

Get started Review contract Open references