Gemini 3.1 Pro Preview

The Gemini 3 flagship with higher hard-problem density

Context window

1,048,576 tokens

Max output

64K

65,536 tokens

Reasoning mode

Deep Think

Hardest multi-step problems

Hard reasoning

Graduate-level science, math proofs, cross-paper hypotheses

Long-horizon engineering

Cross-file refactors, debugging, terminal automation

Agentic

Programming / research / ops agents with reliable tool use

The key shift is not one score but simultaneous gains across GPQA, ARC-AGI-2, HLE, and SWE-bench.

Overview

Gemini 3.1 Pro is Google DeepMind's flagship reasoning model, released on February 19, 2026, and the latest Pro-tier iteration in the Gemini 3 family. It keeps Gemini's native multimodal foundation while pushing academic reasoning, scientific knowledge, and long-horizon agentic coding to a higher level. For the hardest multi-step problems it also offers Deep Think.

CrossModel exposes the preview SKU as gemini/gemini-3.1-pro-preview. Compared with Gemini 2.5 Pro, the important shift is not one isolated score; it is the density of hard problems it handles at once across GPQA Diamond, ARC-AGI-2, Humanity's Last Exam, and SWE-bench Verified.

Key capabilities

Dimension	Detail
Context window	1,048,576 tokens (about 1M)
Max output	65,536 tokens (about 64K)
Input modalities	Text, image (Google's native model also supports audio and video)
Output modalities	Text
Tools	function calling, structured outputs, streaming, thinking / Deep Think, multi-step tool use (MCP)

Gemini Pro-family requests enter a higher tier when single-request input exceeds 200K tokens (roughly 2× input and 1.5× output multipliers). This is a product pricing structure, not a per-unit price; see live rates in the model catalog.

Benchmarks

Gemini 3.1 Pro is strongest on hard reasoning and agentic engineering rather than routine Q&A — it scores on graduate-level science, abstract reasoning, and real software-engineering tasks at the same time.

Reasoning & Multimodal

Scoring on the hardest science, abstraction, and long context at once

GPQA Diamond

94.3%

ARC-AGI-2

77.1%

HLE (with tools)

51.4%

MMMLU

92.6%

MMMU-Pro

80.5%

MRCR v2 @128K

84.9%

HLE is with tools, ARC-AGI-2 is built to separate top models; MRCR v2 measures long-document retrieval.

Among the core figures, GPQA Diamond 94.3% approaches the ceiling of this science-knowledge suite and MMMLU 92.6% shows solid multilingual coverage. The "genuinely hard" signals are Humanity's Last Exam (with tools) 51.4% and ARC-AGI-2 77.1%, both built to separate top models. On multimodal and long-context tasks it posts MMMU-Pro 80.5% and MRCR v2 @128K 84.9%, so retrieval accuracy holds up under long documents and mixed text-image input.

Coding and agents

Coding & Agents

Software engineering is 3.1 Pro’s biggest upgrade

SWE-bench Verified

80.6%

Single submission

Terminal-Bench 2.0

68.5%

LiveCodeBench Pro

2887 Elo

MCP Atlas

69.2%

SWE-bench Pro

54.2%

Read

Repo state, logs, and failing tests

Edit

Cross-file changes organized into a patch

Call tools

Run tests, terminal commands, and MCP tools

Self-correct

Iterate on feedback until green

These metrics measure reliable tool calls, file edits, and self-correction rather than one-shot answers.

Software engineering is the biggest upgrade: SWE-bench Verified 80.6% (single submission), Terminal-Bench 2.0 68.5% for real repo fixes and terminal workflows, and LiveCodeBench Pro 2887 Elo for competition-grade algorithmic coding. Closer to agent deployment are MCP Atlas 69.2% for multi-step tool workflows and SWE-bench Pro 54.2% for diverse tasks — these measure reliable, repeated tool calls, file edits, and self-correction rather than one-shot answers.

When to use it

Hard reasoning and research: graduate-level science, math proofs, and cross-paper hypothesis checks; Deep Think can trade extra latency for accuracy on the hardest steps.
Long software tasks: cross-file refactors, debugging, and terminal automation, with the 1M window holding an entire codebase in one request.
Agentic workflows: programming, research, and operations agents that need reliable multi-step tool use and self-correction.
Multimodal analysis: documents, charts, and long video-derived material understood and structured.

CrossModel exposes Gemini 3.1 Pro through an OpenAI-compatible /v1/chat/completions API. Current pricing is available in the model catalog.