CrossModel
Back to model catalog

OpenAI · Model guide

GPT-5.5

openai/gpt-5.5
Modalities
TextImageText
Context
1M
Max output
128K

GPT-5.5

Overview

GPT-5.5 is OpenAI's flagship model released on April 23, 2026. It is aimed at real computer work rather than single-turn question answering: code, research, data analysis, documents, spreadsheets, browsers, and file systems can all live in the same workflow while the model keeps a long-running goal in view.

OpenAI's release material emphasizes better task completion at latency close to GPT-5.4, and more efficient token use on Codex tasks. In the API, gpt-5.5 is the default flagship; gpt-5.5-pro is the slower, higher-compute sibling for the hardest work.

Key capabilities

DimensionDetail
Context window1,050,000 tokens (about 1.05M)
Max output128,000 tokens
Input modalitiesText, image
Output modalitiesText
Toolsfunction calling, structured outputs, streaming, web search, file search, image generation, code interpreter, hosted shell, apply patch, computer use, MCP

Inputs above 272K tokens enter a higher long-context tier (2x input and 1.5x output), covering standard, batch, and flex modes. See live pricing in the model catalog.

Benchmarks

GPT-5.5's evaluation story spans long-context retrieval, agentic coding, computer and knowledge work, security, and factual reliability. The theme is not winning every isolated benchmark; it is finishing long tasks more reliably than GPT-5.4.

Long-context retrieval

Long Context

Recovers key facts near the 1M-token range

MRCR v2 8-needle (512K-1M)
74.0%
GPT-5.4: 36.6%
Graphwalks BFS 1M
45.4%
F1 · GPT-5.4: 9.4%
Context window
1.05M
tokens

Compared with GPT-5.4 on the same axis, GPT-5.5 is not just larger-context; it is much more reliable from 512K to 1M.

On OpenAI MRCR v2 8-needle in the 512K-1M range, GPT-5.5 rises from GPT-5.4's 36.6% to 74.0%. Graphwalks BFS 1M F1 improves from 9.4% to 45.4%. This matters for large repos, long contracts, research packets, and multi-file tasks: the model does not just fit the context, it retrieves from it more reliably.

Coding and terminal work

Coding & Terminal

Agentic coding scores closer to real engineering

Terminal-Bench 2.0
82.7%
GPT-5.4: 75.1%
SWE-Bench Pro
58.6%
Public
Expert-SWE
73.1%

These tasks require reading the environment, running commands, editing code, observing failures, and iterating.

OpenAI called GPT-5.5 its strongest agentic coding model at release. It reaches 82.7% on Terminal-Bench 2.0, ahead of GPT-5.4 at 75.1%; 58.6% on SWE-Bench Pro Public; and 73.1% on Expert-SWE. These tests resemble real engineering loops: read the environment, run commands, edit, observe failure, and fix again.

Computer use, knowledge work, security, and reliability

Computer Use & Knowledge Work

Strength beyond code: documents, spreadsheets, and financial models

OSWorld-Verified
78.7%
GPT-5.4: 75.0%
GDPval
84.9%
wins or ties
IB Modeling
88.5%
investment-banking modeling
OfficeQA Pro
54.1%

From browser and desktop tasks to investment-banking modeling and office QA, GPT-5.5 is tuned for longer deliverable workflows.

OSWorld-Verified reaches 78.7%, above GPT-5.4's 75.0%. GDPval wins or ties is 84.9%, investment-banking modeling is 88.5%, and OfficeQA Pro is 54.1%.

Security

High-capability tier for vulnerability discovery and defensive work

CVE-Bench
93.1%
pass@1
Capture-the-Flags
88.1%
challenge tasks
CyberGym
81.8%

OpenAI treats cybersecurity capability as High capability, so production use should pair it with trusted access, logs, and permission boundaries.

Security evaluations include CVE-Bench 93.1% pass@1, Capture-the-Flags 88.1%, and CyberGym 81.8%.

Factual Reliability

More claims per answer, fewer factual errors

Response hallucination rate
9.2%
GPT-5.4: 9.5% (lower is better)
Claim hallucination rate
2.0%
GPT-5.4: 2.6% (lower is better)
Claims per response
13.5
GPT-5.4: 11.2

Lower hallucination while making more claims is especially valuable in long multi-step workflows.

OpenAI also reports lower hallucination rates despite more claims per response: response-level hallucination falls from 9.5% to 9.2%, and claim-level hallucination from 2.6% to 2.0%.

When to use it

  • Long-context engineering: codebases, test output, design docs, and issues in one window.
  • Agentic workflows: browser, terminal, file search, code interpreter, and MCP together.
  • Professional knowledge work: financial models, research, contracts, reports, and decks.
  • Security and defensive work: vulnerability analysis, code review, and configuration checks with proper permissions and audit.

CrossModel exposes GPT-5.5 through an OpenAI-compatible /v1/chat/completions API. Current pricing is available in the model catalog.