CrossModel
Back to model catalog

Gemini · Model guide

Gemini 2.5 Flash

gemini/gemini-2.5-flash
Modalities
TextImageAudioVideoText
Context
1M
Max output
66K
Gemini 2.5 Flash

A fast thinking model with a controllable thinking budget

Context window
1M
1,048,576 tokens
Non-thinking throughput
225
tokens/s
Thinking budget
Tunable
0 → high, auto-calibrated
High-frequency low-latency
Intent classification, realtime summaries, RAG, support routing
Mixed difficulty
Thinking off for easy requests, raised for hard cases
High-volume multimodal
Fast screenshot, chart, document understanding with structured output

One model serves as a very fast lightweight model and a deeper reasoner, with no need to maintain multiple SKUs.

Overview

Gemini 2.5 Flash is Google DeepMind's fast thinking model, announced at Google I/O on May 20, 2025. Its defining feature is a controllable thinking budget: developers can trade quality, cost, and latency within the same model instead of routing every hard request to a different SKU.

CrossModel exposes it as gemini/gemini-2.5-flash. It is designed for frequent, scaled, latency-sensitive production use while retaining 1M context and native multimodal input.

Key capabilities

DimensionDetail
Context window1,048,576 tokens (about 1M)
Max output65,536 tokens (about 64K)
Input modalitiesText, image (Google's native model also supports audio and video)
Output modalitiesText
Toolsfunction calling, structured outputs, streaming, controllable thinking budget

Flash uses a single pricing tier without a long-context surcharge. Higher thinking budgets produce more reasoning/output tokens, so they raise per-request cost as a product tradeoff rather than a separate rate rule. See live rates in the model catalog.

Thinking budget: one model, several operating points

Thinking Budget

One model, several operating points

Non-thinking throughput
225 t/s
Among the fastest of its gen
Thinking budget
Tunable
Quality / cost / latency
Model IDs
1
No SKU switching
01
Budget = 0
Fast and lightweight: classification, extraction, routing
02
Raise budget
Think longer before answering: math, code, reasoning
03
Auto-calibrate
No budget set → effort scales with task complexity
04
Mixed traffic
Easy requests stay cheap and fast, hard ones get more reasoning

With no explicit budget set, the model calibrates effort based on task complexity.

The core value of 2.5 Flash is not one benchmark — it is being tunable. Set the thinking budget to 0 and it behaves like a very fast, efficient lightweight model for classification, extraction, and routing. Raise the budget and it spends more effort before answering, improving math, code, and reasoning tasks. If no budget is set explicitly, the model calibrates effort based on task complexity.

In non-thinking mode, throughput is roughly 225 tokens/s, making it one of the faster models of its generation. That "one model covers easy and hard requests" flexibility is especially useful in mixed production traffic: there is no need to maintain several model IDs for different difficulty levels — easy requests stay cheap and fast, while harder ones get more reasoning.

When to use it

  • High-frequency low-latency services: intent classification, realtime summaries, RAG, and support routing.
  • Mixed-difficulty traffic: turn thinking off for easy requests and raise it for hard cases.
  • Frequent multimodal tasks: screenshot, chart, and document understanding with structured output.
  • Cost-sensitive scaled inference: use the budget control to keep unit economics predictable.

CrossModel exposes Gemini 2.5 Flash through an OpenAI-compatible /v1/chat/completions API. Current pricing is available in the model catalog.