News

How Perplexity Optimized GPT-OSS: Infrastructure, Performance, and Cost of OpenAI’s New Models

Article Highlights:
  • erplexity optimized OpenAI’s GPT-OSS models for inference on NVIDIA H200 GPUs
  • Targeted infrastructure choices enabled immediate support for new models
  • ROSE, the in-house inference engine, ensures flexibility and performance
  • Custom kernels for sink attention and MoE with bias
  • Detailed cost and performance analysis for various parallelism configurations
  • Rapid integration of Harmony format for chat and tool calls
  • Modular approach facilitates adoption of new models and formats
  • Optimization between latency, throughput, and operational costs
  • Collaboration with OpenAI for timely adaptations
  • ROSE migrates critical components from Python to Rust for maximum efficiency
How Perplexity Optimized GPT-OSS: Infrastructure, Performance, and Cost of OpenAI’s New Models

Introduction

Perplexity stands out for its mission: making the best AI models accessible to those seeking reliable answers and agentic actions. With OpenAI’s recent release of open-weight models, GPT-OSS-20B and GPT-OSS-120B, Perplexity was among the first to test and optimize them for large-scale inference.

Infrastructure and Hardware Choices

Initial evaluation of GPT-OSS models was performed on NVIDIA H200 Hopper clusters, leveraging MXFP4 quantization to fit models into available memory. Although Hopper lacks FP4 tensor cores, Perplexity opted for FP8 precision to maximize performance with minimal kernel changes.

Transformer Model Architecture

  • Input embedding
  • Sequence of transformer layers with attention and MLP/MoE blocks
  • Final logit projection

Thanks to collaboration with OpenAI, Perplexity quickly adapted its infrastructure, implementing targeted changes such as:

  • GQA attention with sink parameters
  • YaRN positional encoding
  • QKV projections with bias
  • MLP with SwiGLU activation and sparse MoE

ROSE Inference Engine

The core of Perplexity’s inference is ROSE, a flexible engine enabling rapid integration of new models and performance optimization. ROSE, mainly written in Python with critical components migrated to Rust, manages token generation via multiple decoders and supports advanced quantization and parallelism configurations.

New Model Integration Process

  1. Define model hierarchy
  2. Convert weights
  3. Implement forward pass
  4. Test and optimize parallelism

Custom Kernels for GPT-OSS

Perplexity adapted FlashInfer and DeepGEMM kernels to support GPT-OSS specifics, such as sink attention and bias handling in MoE layers. These changes ensured efficiency and numerical stability, guaranteeing accuracy and speed.

Tokenization and Chat Formats

OpenAI’s Harmony format introduces channels and roles in messages, enhancing transparency and segmentation of responses. Perplexity integrated Harmony via a dedicated frontend formatter, keeping the backend unchanged and enabling modularity and rapid adoption of new formats.

Cost and Performance Analysis

The compact size of GPT-OSS models allows limiting replicas to a single node, reducing communication costs. Perplexity tested various parallelism strategies (EP, DP, TP) to find the best balance between latency and cost, with detailed results for batch size and sequence length.

  • Batch size 1 and DP=1 optimize prefill phase
  • EP4 DP1 TP4 configuration is ideal for decode phase

Conclusions

Thanks to targeted infrastructure choices and kernel optimizations, Perplexity enabled Day-0 inference for GPT-OSS models, ensuring high performance and low costs. ROSE’s modular and flexible approach allows rapid adaptation to generative AI advancements.

Introduction Perplexity stands out for its mission: making the best AI models accessible to those seeking reliable answers and agentic actions. With [...] Evol Magazine