The Gemini 3 flagship with higher hard-problem density
The key shift is not one score but simultaneous gains across GPQA, ARC-AGI-2, HLE, and SWE-bench.
Overview
Gemini 3.1 Pro is Google DeepMind's flagship reasoning model, released on February 19, 2026, and the latest Pro-tier iteration in the Gemini 3 family. It keeps Gemini's native multimodal foundation while pushing academic reasoning, scientific knowledge, and long-horizon agentic coding to a higher level. For the hardest multi-step problems it also offers Deep Think.
CrossModel exposes the preview SKU as gemini/gemini-3.1-pro-preview. Compared with Gemini 2.5 Pro, the important shift is not one isolated score; it is the density of hard problems it handles at once across GPQA Diamond, ARC-AGI-2, Humanity's Last Exam, and SWE-bench Verified.
Key capabilities
| Dimension | Detail |
|---|---|
| Context window | 1,048,576 tokens (about 1M) |
| Max output | 65,536 tokens (about 64K) |
| Input modalities | Text, image (Google's native model also supports audio and video) |
| Output modalities | Text |
| Tools | function calling, structured outputs, streaming, thinking / Deep Think, multi-step tool use (MCP) |
Gemini Pro-family requests enter a higher tier when single-request input exceeds 200K tokens (roughly 2× input and 1.5× output multipliers). This is a product pricing structure, not a per-unit price; see live rates in the model catalog.
Benchmarks
Gemini 3.1 Pro is strongest on hard reasoning and agentic engineering rather than routine Q&A — it scores on graduate-level science, abstract reasoning, and real software-engineering tasks at the same time.
Scoring on the hardest science, abstraction, and long context at once
HLE is with tools, ARC-AGI-2 is built to separate top models; MRCR v2 measures long-document retrieval.
Among the core figures, GPQA Diamond 94.3% approaches the ceiling of this science-knowledge suite and MMMLU 92.6% shows solid multilingual coverage. The "genuinely hard" signals are Humanity's Last Exam (with tools) 51.4% and ARC-AGI-2 77.1%, both built to separate top models. On multimodal and long-context tasks it posts MMMU-Pro 80.5% and MRCR v2 @128K 84.9%, so retrieval accuracy holds up under long documents and mixed text-image input.
Coding and agents
Software engineering is 3.1 Pro’s biggest upgrade
These metrics measure reliable tool calls, file edits, and self-correction rather than one-shot answers.
Software engineering is the biggest upgrade: SWE-bench Verified 80.6% (single submission), Terminal-Bench 2.0 68.5% for real repo fixes and terminal workflows, and LiveCodeBench Pro 2887 Elo for competition-grade algorithmic coding. Closer to agent deployment are MCP Atlas 69.2% for multi-step tool workflows and SWE-bench Pro 54.2% for diverse tasks — these measure reliable, repeated tool calls, file edits, and self-correction rather than one-shot answers.
When to use it
- Hard reasoning and research: graduate-level science, math proofs, and cross-paper hypothesis checks; Deep Think can trade extra latency for accuracy on the hardest steps.
- Long software tasks: cross-file refactors, debugging, and terminal automation, with the 1M window holding an entire codebase in one request.
- Agentic workflows: programming, research, and operations agents that need reliable multi-step tool use and self-correction.
- Multimodal analysis: documents, charts, and long video-derived material understood and structured.
CrossModel exposes Gemini 3.1 Pro through an OpenAI-compatible /v1/chat/completions API. Current pricing is available in the model catalog.