Introduction
OpenAI has unveiled groundbreaking results with the launch of GDPval, a new benchmark that compares its artificial intelligence models' performance against human professional experts. According to the emerging data, GPT-5 is reaching quality levels comparable to industry specialists across numerous professions.
The New GDPval Benchmark
GDPval represents an innovative attempt to measure how close OpenAI's systems are to surpassing humans in economically valuable work. The benchmark is based on nine industries that contribute most to America's gross domestic product, including crucial sectors like healthcare, finance, manufacturing, and government.
The test evaluates AI performance across 44 different occupations, ranging from software engineers to nurses to journalists. For the first version, GDPval-v0, OpenAI asked experienced professionals to compare AI-generated reports with those produced by other professionals, selecting the best one.
GPT-5's Surprising Results
GPT-5-high, an enhanced version of GPT-5 with extra computational power, was ranked as better than or on par with industry experts 40.6% of the time. This represents significant progress compared to GPT-4o, which achieved only 13.7% approximately 15 months ago.
"[Because] the model is getting good at some of these things, people in those jobs can now use the model, increasingly as capabilities get better, to offload some of their work and do potentially higher value things."
Dr. Aaron Chatterji, Chief Economist at OpenAI
Comparison with Other Models
OpenAI also tested Anthropic's Claude Opus 4.1, which achieved even more impressive results, ranking as better than or on par with experts in 49% of tasks. However, OpenAI attributes this high score to Claude's tendency to create pleasing graphics rather than superior pure performance.
Current Limitations and Future Developments
It's important to note that most working professionals do much more than submit research reports, which is all that GDPval-v0 currently tests. OpenAI acknowledges this limitation and plans to create more robust tests that can account for more industries and interactive workflows.
Despite these limitations, the company considers the progress on GDPval notable. Tejal Patwardhan, OpenAI's evaluations lead, is encouraged by the rate of progress and expects the trend to continue.
Implications for the Future of Work
The results suggest that people in these jobs can now use AI models to focus on more meaningful tasks. This doesn't mean OpenAI's models will immediately start replacing humans in their jobs, but it represents a significant step toward artificial general intelligence (AGI).
Conclusion
GDPval represents an important measurement tool for evaluating AI progress toward human-level capabilities. As traditional benchmarks approach saturation, tests like GDPval could become increasingly important for assessing AI proficiency on real-world tasks.
FAQ
What is OpenAI's GDPval benchmark?
GDPval is a new test that compares OpenAI's AI models' performance against human professional experts across 44 different occupations in nine key industries.
How does GPT-5 perform compared to human experts?
GPT-5-high was ranked as better than or on par with industry experts in 40.6% of tested cases, representing significant improvement over previous models.
Which professions does the GDPval test include?
The benchmark covers 44 occupations across sectors like healthcare, finance, manufacturing, and government, including roles such as software engineers, nurses, and journalists.
Will GPT-5 replace human workers?
Currently no. The test covers only limited tasks, and OpenAI suggests AI can help professionals focus on higher-value activities rather than replace them.
How does Claude Opus 4.1 compare to GPT-5?
Claude Opus 4.1 achieved 49% in tests, but OpenAI attributes this result primarily to its ability to create appealing graphics.
What are the limitations of the GDPval benchmark?
GDPval-v0 currently only tests report production, while professionals perform many other activities. OpenAI plans more comprehensive versions of the test.