Introduction
Google has announced the release of Gemini 2.5 Computer Use, a specialized AI model that represents a significant step forward in the field of intelligent agents capable of directly interacting with user interfaces. Built on Gemini 2.5 Pro's visual understanding and reasoning capabilities, this new model enables developers to create agents that can navigate web pages and applications just as a human would: by clicking, typing, and scrolling. The model is available in preview via the Gemini API on Google AI Studio and Vertex AI, offering superior performance compared to leading alternatives across multiple web and mobile control benchmarks, all with reduced latency.
The Need for User Interface Control
Gemini 2.5 Computer Use addresses a concrete need in today's digital landscape. While AI models can interface with software through structured APIs, many digital tasks still require direct interaction with graphical interfaces. Filling out and submitting forms, manipulating interactive elements like dropdowns and filters, or operating behind login systems are operations that require a more natural and flexible approach.
The ability to natively fill out forms, handle interactive elements, and access protected areas represents a crucial step in developing general-purpose, powerful agents capable of performing complex tasks without requiring specific API integrations for each platform.
How the Model Works
Gemini 2.5 Computer Use's core capabilities are exposed through the new computer_use
tool in the Gemini API and must be operated within an iterative loop. The tool's inputs include the user request, a screenshot of the environment, and a history of recent actions. It's also possible to specify functions to exclude from the full list of supported UI actions or include additional custom functions.
The model analyzes these inputs and generates a response, typically a function call representing a UI action such as clicking or typing. The response may also contain a request for end user confirmation, which is required for certain actions like making purchases. The client-side code then executes the received action.
After the action is executed, a new screenshot of the GUI and the current URL are sent back to the Computer Use model as a function response, restarting the loop. This iterative process continues until the task is complete, an error occurs, or the interaction is terminated by a safety response or user decision.
The model is primarily optimized for web browsers but also demonstrates strong promise for mobile UI control tasks. It is not yet optimized for desktop OS-level control.
Performance and Benchmarks
Gemini 2.5 Computer Use demonstrates excellent performance across multiple web and mobile control benchmarks. Results include self-reported data, evaluations conducted by Browserbase, and Google's internal evaluations. The assessments highlighted how the model outperforms leading market alternatives while offering lower response times.
Tests covered various task categories, from complex web navigation to mobile interface management, demonstrating the model's versatility and reliability in real-world scenarios. Complete evaluation details are available in the Gemini 2.5 Computer Use evaluation information and in Browserbase's blog post.
Safety Approach
Google has adopted a responsible approach from the start, recognizing that AI agents controlling computers introduce unique risks. These risks include intentional misuse by users, unexpected model behavior, and prompt injections or scams present in the web environment. For this reason, implementing appropriate safety measures has been critical.
Safety features have been trained directly into the model to address three key risks, as described in the Gemini 2.5 Computer Use System Card. Additionally, Google provides developers with safety controls that allow them to prevent the model from auto-completing potentially risky or harmful actions.
Controls include a per-step safety service that assesses each action proposed by the model before execution, and system instructions that enable developers to specify that the agent should refuse or request confirmation before taking high-stakes actions. Examples of these actions include compromising system integrity, violating security, bypassing CAPTCHAs, or controlling medical devices.
Practical Applications and Use Cases
Google teams have already deployed the model to production for use cases such as UI testing, which can make software development significantly faster. Versions of this model also power Project Mariner, the Firebase Testing Agent, and some agentic capabilities in AI Mode in Search.
Users from the early access program have tested the model to power personal assistants, workflow automation, and UI testing, achieving excellent results. Some testimonials highlight significant improvements:
"A lot of our workflows require interacting with interfaces meant for humans where speed is especially important. Gemini 2.5 Computer Use is far ahead of the competition, often being 50% faster and better than the next best solutions we've considered."
Poke.com, proactive AI assistant in iMessage, WhatsApp, and SMS
"Our agents run fully autonomously, performing work where small mistakes in collecting and parsing data are unacceptable. Gemini 2.5 Computer Use outperformed other models at reliably parsing context in complex cases, increasing performance by up to 18% on our hardest evals."
Autotab, drop-in AI agent
An interesting case involves Google's payments platform team, which implemented the Computer Use model as a contingency mechanism to address fragile end-to-end UI tests that contributed to 25% of all test failures. With this implementation, the model now successfully rehabilitates over 60% of executions that previously took multiple days to fix.
Getting Started with Gemini 2.5 Computer Use
The model has been available in public preview since the announcement, accessible via the Gemini API on Google AI Studio and Vertex AI. Developers can try the model in a demo environment hosted by Browserbase, or start building their own agent loop locally with Playwright or in a cloud VM with Browserbase.
For those wishing to dive deeper, comprehensive documentation is available both for general use via Google AI Studio and for enterprise use via Vertex AI. Google also encourages developers to share feedback and contribute to the future roadmap through the dedicated Developer Forum.
Conclusion
Gemini 2.5 Computer Use represents a significant advancement in AI agents, offering developers powerful tools to create applications that interact with user interfaces naturally and efficiently. Superior benchmark performance, combined with a careful approach to safety and robust controls for developers, positions this model as a promising solution for automating complex tasks requiring GUI interaction. Testimonials from early adopters confirm the model's potential in real-world scenarios, from reducing development time to improving reliability in automated workflows. With availability in public preview, the developer community will have the opportunity to explore new use cases and contribute to the evolution of this technology.
FAQ
What is Gemini 2.5 Computer Use?
Gemini 2.5 Computer Use is a specialized AI model from Google that enables agents to interact with graphical user interfaces, performing actions such as clicking, typing, and scrolling to complete complex tasks on web and mobile platforms.
How does the Gemini 2.5 Computer Use model work?
The model operates in an iterative loop: it receives screenshots, action history, and user requests, analyzes the inputs, and generates responses in the form of function calls representing UI actions, which are then executed and verified.
What are the main use cases for Gemini 2.5 Computer Use?
Use cases include automated UI testing, AI personal assistants, workflow automation, form filling, and complex navigation of web and mobile applications.
Is Gemini 2.5 Computer Use safe to use?
Google has implemented safety measures built into the model, including per-step action evaluation services and controls that require user confirmation for potentially risky actions such as purchases or system modifications.
How can I access Gemini 2.5 Computer Use?
The model is available in public preview via the Gemini API on Google AI Studio and Vertex AI. Developers can try it in a Browserbase demo environment or integrate it into their own projects.
Which platforms does Gemini 2.5 Computer Use support?
The model is primarily optimized for web browsers and demonstrates strong potential for mobile interface control as well. It is not yet optimized for desktop operating system-level control.
Does Gemini 2.5 Computer Use outperform other similar models?
Yes, according to benchmarks conducted by Google and partners like Browserbase, Gemini 2.5 Computer Use outperforms leading alternatives across multiple web and mobile control tests, with lower latency.
What limitations does Gemini 2.5 Computer Use have?
The model is not yet optimized for full desktop operating system control and requires careful implementation of safety controls by developers to prevent unauthorized or risky actions.