News

FrontierScience: OpenAI Resets the Bar for AI Scientific Research with GPT-5.2

Article Highlights:
  • OpenAI launches FrontierScience (Dec 2025) to replace the saturated GPQA benchmark (92% accuracy).
  • The benchmark features two tracks: 'Olympiad' (constrained calculation) and 'Research' (open-ended PhD-level reasoning).
  • GPT-5.2 is the top performer but only scores 25.2% in the Research track, showing significant room for improvement.
  • Gemini 3 Pro and Claude Opus 4.5 trail behind, with a significant gap in open-ended research tasks.
  • The system introduces rubric-based AI grading to evaluate the thought process, not just the final answer.
FrontierScience: OpenAI Resets the Bar for AI Scientific Research with GPT-5.2

Introduction

December 16, 2025, marks a turning point in evaluating artificial intelligence applied to hard sciences. OpenAI has unveiled FrontierScience, a new benchmark designed to make previous tests obsolete, as they have been saturated by frontier model capabilities. While the well-known GPQA (Google-Proof Q&A) benchmark saw models jump from 39% to 92% accuracy in just two years, FrontierScience introduces such complexity that even the new GPT-5.2 stalls at 25% in the pure research track.

This move isn't just academic: it is a direct response to the need to measure how AI can act as an autonomous lab partner, not just a glorified search engine. With competitors like Google's Gemini 3 Pro and Anthropic's Claude Opus 4.5 closing in, FrontierScience becomes the new battleground for supremacy in scientific AI.

Analysis and Technical Details

FrontierScience is not a simple multiple-choice quiz. It is divided into two distinct tracks to evaluate complementary skills:

  • Olympiad Track: 100 problems designed by gold medalists from international olympiads (IPhO, IChO, IBO). This measures constrained mathematical and theoretical reasoning. GPT-5.2 dominates with 77.1%, followed closely by Gemini 3 Pro (76.1%).
  • Research Track: The real innovation. 60 original tasks created by PhDs, simulating open-ended research problems (e.g., optimizing molecular cloning protocols or analyzing electronic structures in inorganic chemistry). In this domain, scores drop drastically: GPT-5.2 leads with a modest 25.2%, highlighting how far AI still is from complete scientific autonomy.

The Rubric Grading System

Unlike past benchmarks based on string matching (exact answers), the Research track uses a 10-point rubric architecture. This allows for the assessment of the intermediate thought process, not just the final result. To scale evaluation, OpenAI uses a "Model-based grader" approach, where an instance of GPT-5 evaluates responses according to rigorous criteria defined by human experts. A solution is considered valid only if it scores at least 7/10 points.

Market Impact and Competitors

The introduction of FrontierScience comes at a critical time. According to recent data, the GPQA benchmark is now considered "solved" by top-tier models (GPT-5.2 at 92%), rendering it useless for discriminating advanced reasoning capabilities. FrontierScience reopens the gap.

Updated Leaderboard (December 2025)

In the "Research" track, the most indicative for enterprise and academic use, OpenAI's lead is clear:

  • GPT-5.2: 25.2%
  • Claude Opus 4.5: 17.5%
  • Grok 4: 15.9%
  • Gemini 3 Pro: 12.4%

These data suggest that while Google and Anthropic models are excellent at pure calculation (Olympiad), they still struggle to handle the ambiguity and multi-step planning typical of real research. Real-world use cases are already emerging: partners like Red Queen Bio have used preliminary versions of these systems to optimize lab protocols, reducing iteration cycles from weeks to hours. However, the 25% success rate indicates that "Human-in-the-loop" supervision remains not just recommended, but mandatory.

Conclusion

FrontierScience serves as the "North Star" for 2026. If FrontierMath tested pure logic, this benchmark tests the capacity for discovery. The message for enterprises and R&D labs is clear: current models like GPT-5.2 are ready to accelerate structured tasks (literature review, complex calculations), but they cannot yet replace expert judgment in defining new hypotheses. The race to 100% on FrontierScience has officially begun.

FAQ

What is FrontierScience?

FrontierScience is OpenAI's new benchmark released in December 2025 to evaluate AI scientific reasoning. It includes Olympiad-level problems and open-ended research tasks created by PhD scientists.

Why does FrontierScience replace GPQA?

GPQA is considered saturated: GPT-5.2 scores 92% on that test. FrontierScience is much harder, with top models reaching only 25% in the research track, offering a more useful metric for future progress.

Which model is the most powerful on FrontierScience?

Currently, GPT-5.2 is the undisputed leader, scoring 77.1% on the Olympiad track and 25.2% on the Research track, outperforming Gemini 3 Pro and Claude Opus 4.5.

How is the Research track graded?

It is graded via a 10-point rubric that analyzes intermediate reasoning steps. A response is considered correct if it scores at least 7 out of 10 points, evaluated by a model-based grading system.

Introduction December 16, 2025, marks a turning point in evaluating artificial intelligence applied to hard sciences. OpenAI has unveiled FrontierScience, a Evol Magazine
Tag:
OpenAI