GPT-OSS Optimization: Infrastructure and Performance of OpenAI Models by Perplexity

How Perplexity Optimized GPT-OSS: Infrastructure, Performance, and Cost of OpenAI’s New Models

11 August 2025

Article Highlights:

erplexity optimized OpenAI’s GPT-OSS models for inference on NVIDIA H200 GPUs
Targeted infrastructure choices enabled immediate support for new models
ROSE, the in-house inference engine, ensures flexibility and performance
Custom kernels for sink attention and MoE with bias
Detailed cost and performance analysis for various parallelism configurations
Rapid integration of Harmony format for chat and tool calls
Modular approach facilitates adoption of new models and formats
Optimization between latency, throughput, and operational costs
Collaboration with OpenAI for timely adaptations
ROSE migrates critical components from Python to Rust for maximum efficiency

Introduction

Perplexity stands out for its mission: making the best AI models accessible to those seeking reliable answers and agentic actions. With OpenAI’s recent release of open-weight models, GPT-OSS-20B and GPT-OSS-120B, Perplexity was among the first to test and optimize them for large-scale inference.

Infrastructure and Hardware Choices

Initial evaluation of GPT-OSS models was performed on NVIDIA H200 Hopper clusters, leveraging MXFP4 quantization to fit models into available memory. Although Hopper lacks FP4 tensor cores, Perplexity opted for FP8 precision to maximize performance with minimal kernel changes.

Transformer Model Architecture

Input embedding
Sequence of transformer layers with attention and MLP/MoE blocks
Final logit projection

Thanks to collaboration with OpenAI, Perplexity quickly adapted its infrastructure, implementing targeted changes such as:

GQA attention with sink parameters
YaRN positional encoding
QKV projections with bias
MLP with SwiGLU activation and sparse MoE

ROSE Inference Engine

The core of Perplexity’s inference is ROSE, a flexible engine enabling rapid integration of new models and performance optimization. ROSE, mainly written in Python with critical components migrated to Rust, manages token generation via multiple decoders and supports advanced quantization and parallelism configurations.

New Model Integration Process

Define model hierarchy
Convert weights
Implement forward pass
Test and optimize parallelism

Custom Kernels for GPT-OSS

Perplexity adapted FlashInfer and DeepGEMM kernels to support GPT-OSS specifics, such as sink attention and bias handling in MoE layers. These changes ensured efficiency and numerical stability, guaranteeing accuracy and speed.

Tokenization and Chat Formats

OpenAI’s Harmony format introduces channels and roles in messages, enhancing transparency and segmentation of responses. Perplexity integrated Harmony via a dedicated frontend formatter, keeping the backend unchanged and enabling modularity and rapid adoption of new formats.

Cost and Performance Analysis

The compact size of GPT-OSS models allows limiting replicas to a single node, reducing communication costs. Perplexity tested various parallelism strategies (EP, DP, TP) to find the best balance between latency and cost, with detailed results for batch size and sequence length.

Batch size 1 and DP=1 optimize prefill phase
EP4 DP1 TP4 configuration is ideal for decode phase

Conclusions

Thanks to targeted infrastructure choices and kernel optimizations, Perplexity enabled Day-0 inference for GPT-OSS models, ensuring high performance and low costs. ROSE’s modular and flexible approach allows rapid adaptation to generative AI advancements.

Introduction

Infrastructure and Hardware Choices

Transformer Model Architecture

ROSE Inference Engine

New Model Integration Process

Custom Kernels for GPT-OSS

Tokenization and Chat Formats

Cost and Performance Analysis

Conclusions

Tag:

Related Articles

Anthropic On Track to Profit by 2028: Beats OpenAI by 2 Years

GPT-5.1 Thinking: OpenAI challenges Google Gemini 3 Pro

OpenAI: US AI Leadership at Risk Without Energy