Introduction
SWE-Bench Pro is Scale AI’s new benchmark designed to test AI agents on complex, real-world software engineering tasks. It aims to assess how well artificial intelligence can tackle professional-level problems requiring extensive code changes and deep expertise.
Context
The field of AI-driven software development is rapidly evolving. Current models struggle with long-horizon, enterprise-grade tasks typical of business applications and B2B services.
Direct Definition
SWE-Bench Pro is a benchmark evaluating AI agents on realistic, complex, long-duration software problems from enterprise repositories.
SWE-Bench Pro Features
- 1,865 problems from 41 active repositories, including business apps, B2B services, and developer tools
- Public, held-out, and commercial sets with startup partnerships
- Tasks requiring hours or days of human work, often involving multi-file changes
- All problems are human-verified and provided with sufficient context
The Challenge
Leading AI models, including GPT-5, achieve less than 25% success (Pass@1) on SWE-Bench Pro. This highlights the gap between current AI capabilities and the demands of professional software engineering.
Solution / Approach
The benchmark clusters agent failure modes to better understand model limitations. SWE-Bench Pro offers a contamination-resistant platform, ideal for testing and improving enterprise-grade autonomous agents.
Conclusion
SWE-Bench Pro marks a significant step toward truly autonomous and reliable AI agents for software engineering. Current results show much room for improvement and new research opportunities.
FAQ
What is SWE-Bench Pro and why is it important for AI research?
SWE-Bench Pro is an advanced benchmark testing AI agents on real, complex software problems, crucial for measuring practical AI progress in development.
What types of problems does SWE-Bench Pro include?
The benchmark features long-horizon, multi-file tasks from enterprise and startup repositories.
How do AI agents perform on SWE-Bench Pro?
Top models like GPT-5 reach only 23.3% success, revealing significant limitations.
How does SWE-Bench Pro help improve AI models?
It analyzes agent failures, providing valuable data for developing more effective solutions.
Which companies and technologies are involved in SWE-Bench Pro?
Scale AI leads the project, involving startups and repositories for business apps, B2B services, and developer tools.
Why is SWE-Bench Pro contamination-resistant?
The benchmark protects data and ensures unbiased testing, avoiding external influences on results.
What are the current limitations of AI agents in software research?
AI agents struggle with complex, long-duration tasks, showing frequent errors and low autonomy.
How can you access SWE-Bench Pro results?
Commercial set results are published, while other data is available for research and development.