Pressure-testing Ota on OpenHands: from setup fragmentation to execution governance
How OpenHands pressure-testing hardened Ota’s contract and matrix behavior across native, container, and runtime-proof lanes.
Overview
OpenHands was a useful repo to pressure-test because it is not a toy. It has a Python backend, a frontend dev server, a standalone openhands-ui package, a documented Docker runtime path, and enough setup branching to create false confidence if you only read the docs.
You can usually get OpenHands working by reading Development.md carefully and filling the gaps from habit. That is not the bar. The bar is whether local development, CI, and an agent can choose the same path without guessing.
That was the real question behind this pressure test:
Can one contract make path selection explicit enough that a human, CI job, and coding agent all hit the same operating surface on purpose?
Why OpenHands was a strong test
OpenHands already has strong docs. The problem was not missing documentation. The problem was that the repo's real operating model was spread across too many places:
Development.mdexplained the happy pathMakefilecarried real execution behavior- workflow YAML encoded what CI actually trusted
- maintainer memory filled in the rest
That split matters because OpenHands has multiple valid ways to run:
- a host-native contributor path with Python, Node, Poetry, and
INSTALL_DOCKER=0 - a documented Docker wrapper path through
make docker-run - a packaged container path that is better modeled as a first-class container service
- a separate
openhands-uipackage that uses Bun instead of the main app toolchain
That is exactly where readiness drift starts. Nothing is obviously broken, but the repo still makes people infer too much.
Where the repo actually fought back
Three things made this a real test instead of a cosmetic one.
First, "run the app" was not one thing. The host-native GUI loop and the Docker runtime loop are both legitimate, but they serve different needs. If they are collapsed into one vague app workflow, the contract becomes decorative instead of operational.
Second, openhands-ui is not just "more frontend." It has its own Bun-based build surface and lockfile. Treating it like part of the main frontend/ path would make the contract easier to write and less truthful.
Third, some of the repo's important assumptions live in environment toggles and context, not just task names. INSTALL_DOCKER=0, RUNTIME=local, host-only constraints, and container execution mode are all easy for a maintainer to remember and easy for an agent to miss.
That is the class of repo where Ota has to do more than list commands. It has to make execution intent legible.
What changed
The result was a narrower but much more explicit readiness contract:
- contexts that separate
host-dev,docker-host,host-ui, and the ephemeralappcontainer path - workflows for
app,backend,frontend,app:docker,app:container, andui-package - explicit surfaces for
backend,frontend, andpackaged-web - a matrix that checks contract validity, dry-run behavior, real task execution, and runtime proof without pretending those are the same signal
- pinned Ota install and contract minimum version on
v1.6.18
Links:
- PR: OpenHands/OpenHands#14604
- Contract: ota.yaml
- Matrix workflow: test-ota-contract-matrix.yml
- Latest full green matrix run: #26743766228
- Earlier baseline green run: #26725986085
The contract decisions that mattered
The important part was not "we added more workflows." The important part was drawing boundaries that match how the repo actually behaves.
appis the canonical local GUI workflow. It follows the documented self-development path and exposes backend and frontend readiness separately.app:dockerstays tied to the repo's ownmake docker-runwrapper. That matters because it preserves the documented runtime path instead of replacing it with Ota opinion.app:containermodels the packaged image as a direct Ota-managed container service. That gives a cleaner surface when you want the packaged runtime itself, not the repo wrapper around it.ui-packagestands on its own because the Bun-basedopenhands-uipackage is a distinct build surface, not just an implementation detail of the main frontend loop.
Those distinctions look fussy until you try to automate the repo. Then they are the difference between "agent can run something" and "agent can choose the right thing."
Before and after
Before:
- follow docs and infer which path you actually mean
- manually reconcile host-native, Docker-wrapper, and packaged-container behavior
- treat CI workflow behavior as proof of intent after the fact
- rely on maintainer memory for which tasks are safe to dry-run, safe to execute, or meaningful for agents
After:
- choose a declared workflow that already encodes the intended path
- dry-run that path before mutating anything
- validate readiness through named surfaces instead of log inspection
- keep agent execution bounded by declared tasks, modes, and safety labels
Why the matrix had to be opinionated
One of the most useful outcomes here was not in ota.yaml. It was in the matrix design.
The generic "execute non-internal non-runtime tasks" loop intentionally excludes category: test.
Why:
- OpenHands has aggregate test surfaces that can fail for reasons unrelated to basic repo readiness
- the generic execution lane should prove that the contract is runnable, not collapse into a catch-all CI job
- deterministic signal is more valuable than noisy completeness in a pressure matrix
- tests still run in dedicated verification lanes like
lint:backend,lint:frontend, andtest:backend
That split matters. If every lane means everything, the matrix stops teaching you anything. Here the lanes do different jobs on purpose: validation, dry-run topology, safe execution coverage, and runtime proof.
Takeaway
OpenHands was useful because it forced Ota to be specific.
The repo already had commands. It already had docs. It already had CI. What it did not have was one machine-checkable contract that said:
- this is the native contributor path
- this is the documented Docker path
- this is the packaged container path
- this is the separate UI package path
- these are the surfaces that prove each one is actually ready
That is the difference between command wrapping and execution governance.
OpenHands did not just give Ota a nice demo repo. It forced the contract to carry real operational distinctions that contributors and agents usually keep in their heads. That is exactly the kind of pressure test Ota needs.
Take action