News

Building Effective Harnesses for Long-Running AI Agents

Article Highlights:
  • AI agents need persistent memory for long-running tasks
  • Two-fold approach uses an initializer agent and a coding agent
  • feature_list.json prevents premature project completion
  • Git and progress logs keep state clean between sessions
  • End-to-end testing with Puppeteer is crucial for quality
  • Agents must work on a single feature at a time
  • Proper initialization reduces cognitive load for future sessions
Building Effective Harnesses for Long-Running AI Agents

Introduction

As AI model capabilities increase, developers are increasingly asking agents to tackle complex tasks spanning hours or even days. However, ensuring that long-running AI agents make consistent progress across multiple context windows remains an open challenge.

The core problem lies in the fact that agents work in discrete sessions. Each new session begins with no memory of what happened before, similar to a software project staffed by engineers working in shifts without communicating. Because context windows are limited and complex projects cannot be completed in a single run, a system is needed to bridge the gap between coding sessions.

In this article, we will explore the solution developed to enable the Claude Agent SDK to operate effectively over long time horizons, using a two-fold approach inspired by human engineering practices.

The Problem: Session Amnesia

Even with context management techniques like compaction, frontier models can fail to build production-quality applications if left to their own devices with high-level prompts. Failures manifest primarily in two ways:

  • Attempting to "One-Shot": The agent tries to do too much at once, exhausting context in the middle of implementation and leaving incomplete, undocumented work for the next session.
  • Premature Declaration of Victory: The agent, seeing some progress, erroneously concludes the job is done, leaving features untested or incomplete.

A long-running AI agent is a system designed to execute complex tasks over multiple sessions, using structured external memory to maintain project consistency and context.

The Solution: A Two-Fold Harness

To address these challenges, the solution relies on dividing roles between two types of agents:

1. Initializer Agent

The very first session uses a specialized prompt. This agent's task is not to write application code, but to set up the environment. This includes creating:

  • An init.sh script to start the development environment.
  • A claude-progress.txt file to log activity.
  • An initial git repository to track file changes.
  • A feature_list.json file detailing project requirements.

2. Coding Agent

Every subsequent session is handled by an agent tasked with making incremental progress. This agent works on a single feature at a time and must leave the environment in a "clean state" at the end of the session, ready for the next shift.

Environment Management and Feature List

To prevent the agent from considering the project finished prematurely, the initializer agent creates a detailed JSON file with all required features. For example, for a claude.ai clone, this file might contain over 200 items.

"It is unacceptable to remove or edit tests because this could lead to missing or buggy functionality."

System prompt instruction

Here is an example of the structure used to track features:

{
  "category": "functional",
  "description": "New chat button creates a fresh conversation",
  "steps": [
    "Navigate to main interface",
    "Click the 'New Chat' button",
    "Verify a new conversation is created",
    "Check that chat area shows welcome state",
    "Verify conversation appears in sidebar"
  ],
  "passes": false
}

Using JSON proved superior to Markdown, as models are less likely to inappropriately change or overwrite the data structure.

Incremental Progress and Testing

Once the initial scaffolding is in place, the coding agent is instructed to work on only one feature at a time. After each change, the agent must:

  1. Run tests to verify the feature.
  2. Update the progress file.
  3. Commit to git with a descriptive message.

This approach allows using git to revert bad changes and recover a working code state, eliminating the need for the next agent to "guess" what happened.

Browser Automation with Puppeteer

A critical point is the agents' tendency to mark a feature as complete without true end-to-end testing. To mitigate this, agents are provided with browser automation tools (like Puppeteer MCP) to test the application as a human user would. This dramatically improved the ability to identify and fix bugs not obvious from code alone.

Typical Session Workflow

With this system in place, a typical agent session starts with a series of standard steps to get bearings:

  • Run pwd to confirm the working directory.
  • Read git logs and progress files to understand current state.
  • Read the feature_list.json file and choose the highest-priority unfinished feature.
  • Run init.sh to start the dev server and verify basic functionality.

Conclusion

Research demonstrates that structuring the work of long-running AI agents through rigorous initialization and incremental progress is key to success. While open challenges remain, such as the effectiveness of specialized multi-agent architectures (e.g., dedicated QA or cleanup agents), this approach provides a solid foundation for autonomous software development.

For technical details and code examples, see the original article: Effective harnesses for long-running agents.

FAQ

Frequently asked questions about managing long-running AI agents.

What is a long-running AI agent?

A long-running AI agent is a system capable of working on complex tasks requiring multiple sessions, maintaining context and memory of work done through log files and version control.

Why is a harness necessary for AI agents?

Without a harness (support structure), agents tend to forget context between sessions or try to complete all work at once, leading to errors and incomplete code.

What is the role of the feature_list.json file?

The feature_list.json file serves to clearly define all project requirements and track the status of each feature (pass/fail), preventing the agent from declaring work finished prematurely.

How is testing handled in autonomous agents?

Agents use browser automation tools like Puppeteer to run end-to-end tests, simulating real user actions to verify that features are actually operational.

What is the purpose of the initializer agent?

The initializer agent prepares the work environment by creating startup scripts, git repositories, and log files, laying the foundation for subsequent agents to work incrementally.

Introduction As AI model capabilities increase, developers are increasingly asking agents to tackle complex tasks spanning hours or even days. However, Evol Magazine