CrossModel
Back to model catalog

Gemini · Model guide

Gemini 3.1 Pro Preview

gemini/gemini-3.1-pro-preview
Modalities
TextImageAudioVideoText
Context
1M
Max output
66K
Gemini 3.1 Pro Preview

The Gemini 3 flagship with higher hard-problem density

Context window
1M
1,048,576 tokens
Max output
64K
65,536 tokens
Reasoning mode
Deep Think
Hardest multi-step problems
Hard reasoning
Graduate-level science, math proofs, cross-paper hypotheses
Long-horizon engineering
Cross-file refactors, debugging, terminal automation
Agentic
Programming / research / ops agents with reliable tool use

The key shift is not one score but simultaneous gains across GPQA, ARC-AGI-2, HLE, and SWE-bench.

Overview

Gemini 3.1 Pro is Google DeepMind's flagship reasoning model, released on February 19, 2026, and the latest Pro-tier iteration in the Gemini 3 family. It keeps Gemini's native multimodal foundation while pushing academic reasoning, scientific knowledge, and long-horizon agentic coding to a higher level. For the hardest multi-step problems it also offers Deep Think.

CrossModel exposes the preview SKU as gemini/gemini-3.1-pro-preview. Compared with Gemini 2.5 Pro, the important shift is not one isolated score; it is the density of hard problems it handles at once across GPQA Diamond, ARC-AGI-2, Humanity's Last Exam, and SWE-bench Verified.

Key capabilities

DimensionDetail
Context window1,048,576 tokens (about 1M)
Max output65,536 tokens (about 64K)
Input modalitiesText, image (Google's native model also supports audio and video)
Output modalitiesText
Toolsfunction calling, structured outputs, streaming, thinking / Deep Think, multi-step tool use (MCP)

Gemini Pro-family requests enter a higher tier when single-request input exceeds 200K tokens (roughly input and 1.5× output multipliers). This is a product pricing structure, not a per-unit price; see live rates in the model catalog.

Benchmarks

Gemini 3.1 Pro is strongest on hard reasoning and agentic engineering rather than routine Q&A — it scores on graduate-level science, abstract reasoning, and real software-engineering tasks at the same time.

Reasoning & Multimodal

Scoring on the hardest science, abstraction, and long context at once

GPQA Diamond
94.3%
ARC-AGI-2
77.1%
HLE (with tools)
51.4%
MMMLU
92.6%
MMMU-Pro
80.5%
MRCR v2 @128K
84.9%

HLE is with tools, ARC-AGI-2 is built to separate top models; MRCR v2 measures long-document retrieval.

Among the core figures, GPQA Diamond 94.3% approaches the ceiling of this science-knowledge suite and MMMLU 92.6% shows solid multilingual coverage. The "genuinely hard" signals are Humanity's Last Exam (with tools) 51.4% and ARC-AGI-2 77.1%, both built to separate top models. On multimodal and long-context tasks it posts MMMU-Pro 80.5% and MRCR v2 @128K 84.9%, so retrieval accuracy holds up under long documents and mixed text-image input.

Coding and agents

Coding & Agents

Software engineering is 3.1 Pro’s biggest upgrade

SWE-bench Verified
80.6%
Single submission
Terminal-Bench 2.0
68.5%
LiveCodeBench Pro
2887 Elo
MCP Atlas
69.2%
SWE-bench Pro
54.2%
01
Read
Repo state, logs, and failing tests
02
Edit
Cross-file changes organized into a patch
03
Call tools
Run tests, terminal commands, and MCP tools
04
Self-correct
Iterate on feedback until green

These metrics measure reliable tool calls, file edits, and self-correction rather than one-shot answers.

Software engineering is the biggest upgrade: SWE-bench Verified 80.6% (single submission), Terminal-Bench 2.0 68.5% for real repo fixes and terminal workflows, and LiveCodeBench Pro 2887 Elo for competition-grade algorithmic coding. Closer to agent deployment are MCP Atlas 69.2% for multi-step tool workflows and SWE-bench Pro 54.2% for diverse tasks — these measure reliable, repeated tool calls, file edits, and self-correction rather than one-shot answers.

When to use it

  • Hard reasoning and research: graduate-level science, math proofs, and cross-paper hypothesis checks; Deep Think can trade extra latency for accuracy on the hardest steps.
  • Long software tasks: cross-file refactors, debugging, and terminal automation, with the 1M window holding an entire codebase in one request.
  • Agentic workflows: programming, research, and operations agents that need reliable multi-step tool use and self-correction.
  • Multimodal analysis: documents, charts, and long video-derived material understood and structured.

CrossModel exposes Gemini 3.1 Pro through an OpenAI-compatible /v1/chat/completions API. Current pricing is available in the model catalog.