Introduction
Archon is a desktop copilot that turns natural language instructions into UI actions by combining GPT‑5 for planning and a small grounding model for precise click coordinates.
Context
The design splits responsibilities: a powerful reasoner (GPT‑5) outputs semantic actions and a lightweight executor (archon‑mini) returns exact screen coordinates; the goal is more autonomous and natural desktop control with controlled latency and cost.
How it works
Core flow: Archon captures screenshots, runs a saliency scorer to extract high‑relevance patches, applies caching for unchanged regions, then sends semantic descriptions to the grounding model which outputs precise (x,y) coordinates. A routing policy escalates to the planner only on ambiguity to save latency.
Technical highlights
- Hierarchical split: GPT‑5 for reasoning, archon‑mini for grounding
- Patching: top‑K patch extraction reduces visual tokens and raises precision
- Cache: reusing invariant patches (70%+ hit rate) lowers latency and GPU cost
- Training: archon‑mini (7B, Qwen‑2.5‑VL) trained with GRPO and synthetic rollouts
- Adaptive routing: fast path (~50ms) and escalations when signals indicate uncertainty
Main trade‑offs
- Accuracy vs latency: deeper reasoning improves robustness but increases delay
- Vision token cost: mitigated via patching, downsampling and caching
- UI robustness: some element types need more data to handle reliably
Conclusion
Archon shows that separating planner and executor, using patch‑based grounding and adaptive caching, is a practical path toward a self‑driving computer. Source: Surya Dantuluri.
FAQ
How does Archon build a self‑driving computer on the desktop?
It splits planning (GPT‑5) and grounding (archon‑mini): the planner outputs semantic actions and the executor returns precise pixel coordinates from salient patches.
What latency and cost limits affect Archon?
Visual tokens and deep reasoning raise latency and cost; Archon mitigates this with patching, caching, and an adaptive routing policy.
How does archon‑mini perform GUI grounding?
It extracts top patches via a saliency scorer, encodes them and outputs (x,y) clicks; training uses GRPO and synthetic rollouts for robustness.
What safety risks should be considered for a self‑driving computer?
Risks include incorrect actions on sensitive UIs and interface drift; the planner acts as a safety guard for ambiguous cases.
How do you measure efficiency of Archon in practice?
Useful metrics: per‑action latency (ms), patch cache hit‑rate, escalation frequency to the planner, and end‑to‑end success rate.