Stop Tuning the Model. Start Tuning the Harness.

“Harness” a new buzzword. I explain why you’re already using one and why it’s important to understand the other options available now.

May 18, 2026

This article was reviewed and extensively hand-edited, using curated sources I hand-picked, drafted by AI using personalised style guidance. I find it useful to keep in touch with the dozens of X.com bookmarks I come across every few days. Maybe you do too.

Four teams, four different ways to run agents.

A division of labour is emerging in agent engineering.

The model picks the next token.
The harness (Claude Code, Codex, Opencode, Gemini CLI, pi.dev etc ) decides everything else: what context survives, what tools get exposed, when the loop terminates, where the code runs, what happens when memory overflows.

Billions have gone to the first job. Let’so look at the second part.

Vtrivedy puts it cleanly: “The Harness is a Context Manager on Behalf of the Model… What happens when the context window fills up and who decides? This decision is external to the model”.

Once you internalise this, the design space expands.

Sandbox boundaries, memory compaction, trace introspection, pause/resume protocols are runtime decisions that ship before the model ever sees a token.

From isolated tools to isolated agents

Larsen Cundric at Browser Use wrote up their sandbox evolution, sharing what gets isolated.

They started by

isolating the tool (agent on your infra, dangerous calls hit a sandbox) and shifted to
isolating the agent (whole agent runs in a sandbox with zero secrets, talks to a control plane that holds credentials).

That second pattern makes the agent switchable. A killed agent loses nothing, a restart costs nothing, scale-out is cheap, and there are no secrets sitting in its memory for a prompt injection to exfiltrate.

Credentials live in one place that the agent can only reach by URL.

Production runtime is Unikraft micro-VMs that boot in under a second and scale to zero when idle (remember “serverless”?)

The same Docker image runs locally for agent evaluations (a term mixing build and run time testing).

Boot times are important, but the architecture more so:

the agent owns nothing, the control plane owns everything, and
the protocol between them becomes the thing engineers optimise.

Browser Use Pattern 2 diagram — Browser Use's Pattern 2: the agent runs sandboxed with no credentials, all secrets held by the control plane.

Memory as a layered system, not a vector pile

Tencent open-sourced TencentDB Agent Memory.

Integrated with the OpenClaw, with big purported boosts on metrics I’ve never heard of but sound important 🙂:

61% fewer tokens on WideSearch,
a 51% relative pass-rate lift on the same benchmark, and
accuracy lifted from 48% to 76%.

Most memory systems break conversation into chunks and dump them in a vector store, leaving recall to grope blindly across disconnected fragments. Classic RAG.

Tencent rejects that approach and builds two pillars instead:

symbolic short-term memory (raw tool outputs at the bottom, step summaries in the middle, a Mermaid canvas at the top), and
long-term memory layered into personas and scenes.

The Mermaid canvas makes a LOT of sense — I’ve personally had a lot of success with mermaid diagrams because they’re extremely efficient at communicating architectural concepts and it’s why Ceetrix uses them a lot. Using them to track state is genius.

@berryxia’s writeup calls out the same idea: structured task maps generated as Mermaid graphs, so the agent always knows which step it’s on in a 30-step workflow. Coherence across long workflows comes from a structured map of the plan, not from cramming more turns into the prompt.

Usual benchmark vendor bias of course. SWE-bench, a gold standard for choice generation quality, gains 9.93% and AA-LCR of 7.95% can be absolutely huge depending on what the model is already doing.

Separately, the token reduction is actually more valuable than you’d think because information retrieval from context is not consistent and degrades very quickly as the context gets bigger with current models.

Let the harness optimise itself

Sam Hogan’s HALO is the most contrarian piece in this batch. The premise: humans are bad harness designers, so use a reasoning model to do it.

HALO takes execution traces from a deployed agent, identifies failure modes that only show up in aggregate, suggests harness changes, and a coding agent applies them.

Reported gains include 10%+ improvements on AppWorld, TerminalBench, and Finance Bench.

The strongest result is on Terminal-Bench, where they pushed Gemini Flash toward frontier coding harness performance through harness changes alone. This is pet remarkable and I’m keen to replicate it.

Hogan says: “the harness is becoming an optimizable service layer: comparable in importance to the model itself, and increasingly measurable as its own object of engineering.”

Same way Kubernetes split schedulers, networks, and storage from application logic, the agent harness is splitting out from the model wrapper.

In theory HALO works with any harness you have access to code for, etc Opencode, pi.dev, Gemini-CLI, OpenAI Agents SDK.

HALO benchmark results — HALO's reported harness-only improvements across coding and agent benchmarks.

Pause for a week, resume without amnesia

Google’s ADK writeup, shared by Richard Seroter, takes a different cut: durability. The walkthrough builds a new-hire onboarding agent that runs for two weeks, pauses for days while a human signs documents, then resumes without making up stuff.

Stateless chat-loop agents fail over long timeframes in three ways:

context pollution after hundreds of turns,
token costs ballooning on every replay, and
the model inventing approvals that never happened when resumed cold.

Solution: explicit, durable state decoupled from raw chat history, a structural change to where the agent’s truth lives, rather than a larger window to stuff more conversation turns into.

What ties them together

Four teams, four pieces of the same machine.

Sandboxing (Browser Use),
memory compaction (Tencent),
trace-driven self-optimisation (HALO),
durable pause/resume (ADK).

They all assume the model is fixed, and the rest of the system is engineerable.

Not entirely true (models still improve), but the marginal hour of effort on a long-running agent is now better spent on runtime than on prompt tuning.

The teams above are calling that bet explicitly.

🔮 Prediction: the next wave of agent benchmark gains comes from harness changes, not model upgrades. HALO’s frame (the harness as an optimisable service layer) becomes the dominant abstraction within twelve months. The runtime designers, not the prompt engineers, will be the ones quietly moving SOTA numbers on AppWorld and Terminal-Bench.

Making AI Agents with Julian Harris

Discussion about this post

Ready for more?