Introduction
Perplexity stands out for its mission: making the best AI models accessible to those seeking reliable answers and agentic actions. With OpenAI’s recent release of open-weight models, GPT-OSS-20B and GPT-OSS-120B, Perplexity was among the first to test and optimize them for large-scale inference.
Infrastructure and Hardware Choices
Initial evaluation of GPT-OSS models was performed on NVIDIA H200 Hopper clusters, leveraging MXFP4 quantization to fit models into available memory. Although Hopper lacks FP4 tensor cores, Perplexity opted for FP8 precision to maximize performance with minimal kernel changes.
Transformer Model Architecture
- Input embedding
- Sequence of transformer layers with attention and MLP/MoE blocks
- Final logit projection
Thanks to collaboration with OpenAI, Perplexity quickly adapted its infrastructure, implementing targeted changes such as:
- GQA attention with sink parameters
- YaRN positional encoding
- QKV projections with bias
- MLP with SwiGLU activation and sparse MoE
ROSE Inference Engine
The core of Perplexity’s inference is ROSE, a flexible engine enabling rapid integration of new models and performance optimization. ROSE, mainly written in Python with critical components migrated to Rust, manages token generation via multiple decoders and supports advanced quantization and parallelism configurations.
New Model Integration Process
- Define model hierarchy
- Convert weights
- Implement forward pass
- Test and optimize parallelism
Custom Kernels for GPT-OSS
Perplexity adapted FlashInfer and DeepGEMM kernels to support GPT-OSS specifics, such as sink attention and bias handling in MoE layers. These changes ensured efficiency and numerical stability, guaranteeing accuracy and speed.
Tokenization and Chat Formats
OpenAI’s Harmony format introduces channels and roles in messages, enhancing transparency and segmentation of responses. Perplexity integrated Harmony via a dedicated frontend formatter, keeping the backend unchanged and enabling modularity and rapid adoption of new formats.
Cost and Performance Analysis
The compact size of GPT-OSS models allows limiting replicas to a single node, reducing communication costs. Perplexity tested various parallelism strategies (EP, DP, TP) to find the best balance between latency and cost, with detailed results for batch size and sequence length.
- Batch size 1 and DP=1 optimize prefill phase
- EP4 DP1 TP4 configuration is ideal for decode phase
Conclusions
Thanks to targeted infrastructure choices and kernel optimizations, Perplexity enabled Day-0 inference for GPT-OSS models, ensuring high performance and low costs. ROSE’s modular and flexible approach allows rapid adaptation to generative AI advancements.