A fast thinking model with a controllable thinking budget
One model serves as a very fast lightweight model and a deeper reasoner, with no need to maintain multiple SKUs.
Overview
Gemini 2.5 Flash is Google DeepMind's fast thinking model, announced at Google I/O on May 20, 2025. Its defining feature is a controllable thinking budget: developers can trade quality, cost, and latency within the same model instead of routing every hard request to a different SKU.
CrossModel exposes it as gemini/gemini-2.5-flash. It is designed for frequent, scaled, latency-sensitive production use while retaining 1M context and native multimodal input.
Key capabilities
| Dimension | Detail |
|---|---|
| Context window | 1,048,576 tokens (about 1M) |
| Max output | 65,536 tokens (about 64K) |
| Input modalities | Text, image (Google's native model also supports audio and video) |
| Output modalities | Text |
| Tools | function calling, structured outputs, streaming, controllable thinking budget |
Flash uses a single pricing tier without a long-context surcharge. Higher thinking budgets produce more reasoning/output tokens, so they raise per-request cost as a product tradeoff rather than a separate rate rule. See live rates in the model catalog.
Thinking budget: one model, several operating points
One model, several operating points
With no explicit budget set, the model calibrates effort based on task complexity.
The core value of 2.5 Flash is not one benchmark — it is being tunable. Set the thinking budget to 0 and it behaves like a very fast, efficient lightweight model for classification, extraction, and routing. Raise the budget and it spends more effort before answering, improving math, code, and reasoning tasks. If no budget is set explicitly, the model calibrates effort based on task complexity.
In non-thinking mode, throughput is roughly 225 tokens/s, making it one of the faster models of its generation. That "one model covers easy and hard requests" flexibility is especially useful in mixed production traffic: there is no need to maintain several model IDs for different difficulty levels — easy requests stay cheap and fast, while harder ones get more reasoning.
When to use it
- High-frequency low-latency services: intent classification, realtime summaries, RAG, and support routing.
- Mixed-difficulty traffic: turn thinking off for easy requests and raise it for hard cases.
- Frequent multimodal tasks: screenshot, chart, and document understanding with structured output.
- Cost-sensitive scaled inference: use the budget control to keep unit economics predictable.
CrossModel exposes Gemini 2.5 Flash through an OpenAI-compatible /v1/chat/completions API. Current pricing is available in the model catalog.