
Overview
GPT-5.5 is OpenAI's flagship model released on April 23, 2026. It is aimed at real computer work rather than single-turn question answering: code, research, data analysis, documents, spreadsheets, browsers, and file systems can all live in the same workflow while the model keeps a long-running goal in view.
OpenAI's release material emphasizes better task completion at latency close to GPT-5.4, and more efficient token use on Codex tasks. In the API, gpt-5.5 is the default flagship; gpt-5.5-pro is the slower, higher-compute sibling for the hardest work.
Key capabilities
| Dimension | Detail |
|---|---|
| Context window | 1,050,000 tokens (about 1.05M) |
| Max output | 128,000 tokens |
| Input modalities | Text, image |
| Output modalities | Text |
| Tools | function calling, structured outputs, streaming, web search, file search, image generation, code interpreter, hosted shell, apply patch, computer use, MCP |
Inputs above 272K tokens enter a higher long-context tier (2x input and 1.5x output), covering standard, batch, and flex modes. See live pricing in the model catalog.
Benchmarks
GPT-5.5's evaluation story spans long-context retrieval, agentic coding, computer and knowledge work, security, and factual reliability. The theme is not winning every isolated benchmark; it is finishing long tasks more reliably than GPT-5.4.
Long-context retrieval
Recovers key facts near the 1M-token range
Compared with GPT-5.4 on the same axis, GPT-5.5 is not just larger-context; it is much more reliable from 512K to 1M.
On OpenAI MRCR v2 8-needle in the 512K-1M range, GPT-5.5 rises from GPT-5.4's 36.6% to 74.0%. Graphwalks BFS 1M F1 improves from 9.4% to 45.4%. This matters for large repos, long contracts, research packets, and multi-file tasks: the model does not just fit the context, it retrieves from it more reliably.
Coding and terminal work
Agentic coding scores closer to real engineering
These tasks require reading the environment, running commands, editing code, observing failures, and iterating.
OpenAI called GPT-5.5 its strongest agentic coding model at release. It reaches 82.7% on Terminal-Bench 2.0, ahead of GPT-5.4 at 75.1%; 58.6% on SWE-Bench Pro Public; and 73.1% on Expert-SWE. These tests resemble real engineering loops: read the environment, run commands, edit, observe failure, and fix again.
Computer use, knowledge work, security, and reliability
Strength beyond code: documents, spreadsheets, and financial models
From browser and desktop tasks to investment-banking modeling and office QA, GPT-5.5 is tuned for longer deliverable workflows.
OSWorld-Verified reaches 78.7%, above GPT-5.4's 75.0%. GDPval wins or ties is 84.9%, investment-banking modeling is 88.5%, and OfficeQA Pro is 54.1%.
High-capability tier for vulnerability discovery and defensive work
OpenAI treats cybersecurity capability as High capability, so production use should pair it with trusted access, logs, and permission boundaries.
Security evaluations include CVE-Bench 93.1% pass@1, Capture-the-Flags 88.1%, and CyberGym 81.8%.
More claims per answer, fewer factual errors
Lower hallucination while making more claims is especially valuable in long multi-step workflows.
OpenAI also reports lower hallucination rates despite more claims per response: response-level hallucination falls from 9.5% to 9.2%, and claim-level hallucination from 2.6% to 2.0%.
When to use it
- Long-context engineering: codebases, test output, design docs, and issues in one window.
- Agentic workflows: browser, terminal, file search, code interpreter, and MCP together.
- Professional knowledge work: financial models, research, contracts, reports, and decks.
- Security and defensive work: vulnerability analysis, code review, and configuration checks with proper permissions and audit.
CrossModel exposes GPT-5.5 through an OpenAI-compatible /v1/chat/completions API. Current pricing is available in the model catalog.