Background Coding Agent System
Let me walk you through how I'd design a background coding agent system. By "background" I mean the async pattern that Devin, Codex, Cursor Background Agents, and Claude Code in autonomous mode have all converged on: a developer files an issue or describes a change, the system disappears for somewhere between five and forty minutes, and the developer comes back to a pull request they can merge, send back with comments, or close. I'll cover business and ML objectives, the high-level architecture, the agent team itself, the sandbox layer that does most of the heavy lifting, repo onboarding, training, verification, evaluation, and robustness. The thing worth flagging up front is that this design point looks like a chatbot problem in the pitch deck and turns out to be a multi-tenant CI problem the moment you start building it. Most of the design surprises come from that mismatch.
Solution Walkthrough
Business Objective
The business case is that developers waste a remarkable amount of compounded time on tasks that don't require deep design thinking. Small bug fixes from triaged issues, adding test coverage for newly-introduced functions, dependency bumps when a CVE drops, mechanical refactors after an API change, renames, and adding the error handling that should have been there originally. None of these are interesting work. All of them keep getting deprioritized. A background coding agent lets a team offload the boring tail of their backlog, walk away, and come back to a stream of reviewable PRs that they can either merge or close. The unit economics work out somewhere between one and five dollars per task today, which is well below the loaded cost of a developer-hour even if you assume the developer spends ten minutes reviewing each PR.
What gets lost when people sketch this on a whiteboard is that the constraint isn't volume. The constraint is asymmetric review cost. A bad PR is strictly worse than no PR, because once it lands in someone's review queue it's burning attention regardless of whether they merge it. If the agent opens a PR with a sloppy diff or a fix that misunderstands the actual bug, the reviewer spends ten minutes figuring out why before closing it, and they trust the system a little less the next time. Two or three of those in a row and the team turns the feature off. So the product has to optimize for precision over recall: it's better to silently skip thirty percent of tasks the agent doesn't think it can do well than to open thirty percent of PRs that get rejected. This shapes basically every downstream design decision.
The other piece of business context that matters is the async interaction model. Background means the user is not in front of the keyboard waiting. They've submitted the task and walked off to do something else. This sounds like it should make the system easier, no latency budget, but it actually makes it harder, because the system has to make every consequential decision on its own. There's no "hey, should I refactor this differently?" mid-flight. The agent commits to its plan, opens a PR, and the user finds out what it did when they come back.
ML Objective
As an ML problem this is sequential decision-making over genuinely long horizons. The coder agent typically runs somewhere between twenty and a hundred inner-loop steps per task, read a file, edit a file, run a test, look at the failure, edit again, and one bad branch decision early on can sink the whole trajectory. The action space at each step is large; the agent can call any of a dozen tools with arbitrary arguments, the most powerful of which (Edit, Bash) can do almost anything to the repository. The verification signal exists but is partial. We have hard signals (do the tests pass, does the type checker accept the diff, do the linters complain) and we have soft signals (does the diff look like something a human would write, did we touch unrelated files, did we leave debug prints behind).
What makes this genuinely hard is not the average task. Most tasks sit roughly at the difficulty of the easier SWE-bench-Verified items, and modern coder models handle those at decent rates. What makes it hard is the long tail of repo idiosyncrasies. Each codebase has its own build system, its own conventions, its own internal frameworks the agent has never seen, its own ways of running tests, its own gotchas about which directories the IDE indexes versus the test runner. Generic coder models that perform well on public benchmarks often crater on a real internal repo because the surface area of "stuff specific to this codebase" is vast. So the design has to assume the agent will need substantial help to be useful on any given repo. We bake that assumption into the architecture.
And then there's the asymmetric cost I mentioned: a wrong action lands in a PR, and a wrong PR costs human attention. This rules out a lot of strategies that work fine in other agent contexts. We can't be aggressive on exploration. We can't ship "we tried, it didn't quite work, here's what we got" as a finished result. Either the agent produces something worth reviewing or it has to bow out cleanly with a structured explanation of why.
Key Concepts
Unlock Full Solution
Get access to the complete walkthrough, key concepts, summary, and follow-up questions.