Gemini 2.5 Flash

A fast thinking model with a controllable thinking budget

Context window

1,048,576 tokens

Non-thinking throughput

225

tokens/s

Thinking budget

Tunable

0 → high, auto-calibrated

High-frequency low-latency

Intent classification, realtime summaries, RAG, support routing

Mixed difficulty

Thinking off for easy requests, raised for hard cases

High-volume multimodal

Fast screenshot, chart, document understanding with structured output

One model serves as a very fast lightweight model and a deeper reasoner, with no need to maintain multiple SKUs.

Overview

Gemini 2.5 Flash is Google DeepMind's fast thinking model, announced at Google I/O on May 20, 2025. Its defining feature is a controllable thinking budget: developers can trade quality, cost, and latency within the same model instead of routing every hard request to a different SKU.

CrossModel exposes it as gemini/gemini-2.5-flash. It is designed for frequent, scaled, latency-sensitive production use while retaining 1M context and native multimodal input.

Key capabilities

Dimension	Detail
Context window	1,048,576 tokens (about 1M)
Max output	65,536 tokens (about 64K)
Input modalities	Text, image (Google's native model also supports audio and video)
Output modalities	Text
Tools	function calling, structured outputs, streaming, controllable thinking budget

Flash uses a single pricing tier without a long-context surcharge. Higher thinking budgets produce more reasoning/output tokens, so they raise per-request cost as a product tradeoff rather than a separate rate rule. See live rates in the model catalog.

Thinking budget: one model, several operating points

Thinking Budget

One model, several operating points

Non-thinking throughput

225 t/s

Among the fastest of its gen

Thinking budget

Tunable

Quality / cost / latency

Model IDs

No SKU switching

Budget = 0

Fast and lightweight: classification, extraction, routing

Raise budget

Think longer before answering: math, code, reasoning

Auto-calibrate

No budget set → effort scales with task complexity

Mixed traffic

Easy requests stay cheap and fast, hard ones get more reasoning

With no explicit budget set, the model calibrates effort based on task complexity.

The core value of 2.5 Flash is not one benchmark — it is being tunable. Set the thinking budget to 0 and it behaves like a very fast, efficient lightweight model for classification, extraction, and routing. Raise the budget and it spends more effort before answering, improving math, code, and reasoning tasks. If no budget is set explicitly, the model calibrates effort based on task complexity.

In non-thinking mode, throughput is roughly 225 tokens/s, making it one of the faster models of its generation. That "one model covers easy and hard requests" flexibility is especially useful in mixed production traffic: there is no need to maintain several model IDs for different difficulty levels — easy requests stay cheap and fast, while harder ones get more reasoning.

When to use it

High-frequency low-latency services: intent classification, realtime summaries, RAG, and support routing.
Mixed-difficulty traffic: turn thinking off for easy requests and raise it for hard cases.
Frequent multimodal tasks: screenshot, chart, and document understanding with structured output.
Cost-sensitive scaled inference: use the budget control to keep unit economics predictable.

CrossModel exposes Gemini 2.5 Flash through an OpenAI-compatible /v1/chat/completions API. Current pricing is available in the model catalog.