News

Archon: building a practical self-driving computer

Article Highlights:
  • Archon pairs GPT‑5 planner with archon‑mini executor
  • Hierarchical split: what to do vs where to click
  • Saliency patches cut visual token and compute costs
  • Patch caching reduces latency and GPU usage
  • Adaptive router: favor fast path, escalate on ambiguity
  • GRPO and synthetic rollouts improve grounding training
  • Future: streaming frames for smoother actions
  • Goal: distill planner into executor for simplicity
Archon: building a practical self-driving computer

Introduction

Archon is a desktop copilot that turns natural language instructions into UI actions by combining GPT‑5 for planning and a small grounding model for precise click coordinates.

Context

The design splits responsibilities: a powerful reasoner (GPT‑5) outputs semantic actions and a lightweight executor (archon‑mini) returns exact screen coordinates; the goal is more autonomous and natural desktop control with controlled latency and cost.

How it works

Core flow: Archon captures screenshots, runs a saliency scorer to extract high‑relevance patches, applies caching for unchanged regions, then sends semantic descriptions to the grounding model which outputs precise (x,y) coordinates. A routing policy escalates to the planner only on ambiguity to save latency.

Technical highlights

  • Hierarchical split: GPT‑5 for reasoning, archon‑mini for grounding
  • Patching: top‑K patch extraction reduces visual tokens and raises precision
  • Cache: reusing invariant patches (70%+ hit rate) lowers latency and GPU cost
  • Training: archon‑mini (7B, Qwen‑2.5‑VL) trained with GRPO and synthetic rollouts
  • Adaptive routing: fast path (~50ms) and escalations when signals indicate uncertainty

Main trade‑offs

  • Accuracy vs latency: deeper reasoning improves robustness but increases delay
  • Vision token cost: mitigated via patching, downsampling and caching
  • UI robustness: some element types need more data to handle reliably

Conclusion

Archon shows that separating planner and executor, using patch‑based grounding and adaptive caching, is a practical path toward a self‑driving computer. Source: Surya Dantuluri.

 

FAQ

How does Archon build a self‑driving computer on the desktop?

It splits planning (GPT‑5) and grounding (archon‑mini): the planner outputs semantic actions and the executor returns precise pixel coordinates from salient patches.

What latency and cost limits affect Archon?

Visual tokens and deep reasoning raise latency and cost; Archon mitigates this with patching, caching, and an adaptive routing policy.

How does archon‑mini perform GUI grounding?

It extracts top patches via a saliency scorer, encodes them and outputs (x,y) clicks; training uses GRPO and synthetic rollouts for robustness.

What safety risks should be considered for a self‑driving computer?

Risks include incorrect actions on sensitive UIs and interface drift; the planner acts as a safety guard for ambiguous cases.

How do you measure efficiency of Archon in practice?

Useful metrics: per‑action latency (ms), patch cache hit‑rate, escalation frequency to the planner, and end‑to‑end success rate.

Introduction Archon is a desktop copilot that turns natural language instructions into UI actions by combining GPT‑5 for planning and a small grounding [...] Evol Magazine
Tag:
GPT-5