CrossModel
Back to model catalog

Gemini · Model guide

Gemini 2.5 Flash Lite

gemini/gemini-2.5-flash-lite
Modalities
TextImageAudioVideoText
Context
1M
Max output
66K
Gemini 2.5 Flash-Lite

The fastest, cheapest, lowest-latency tier in the 2.5 family

Context window
1M
1,048,576 tokens
Non-thinking throughput
215
tokens/s
Thinking
Optional
optional thinking
High-volume narrow tasks
Classification, extraction, translation, tagging, routing
Scaled online services
RAG preprocessing, realtime summaries, lightweight Q&A
First layer of a router
Escalate to Flash / Pro only on failure or ambiguity

Shares the 1M context and native multimodal foundation with Flash and Pro, but pushes the tradeoff toward speed and cost.

Overview

Gemini 2.5 Flash-Lite is Google DeepMind's lightweight reasoning model, released on June 17, 2025. It is the fastest, lowest-latency, lowest-cost tier in the Gemini 2.5 family, designed for large-scale calls where throughput and response time matter as much as quality.

CrossModel exposes it as gemini/gemini-2.5-flash-lite. It shares the 2.5 family's 1M context and native multimodal foundation with Flash and Pro, but pushes the tradeoff toward speed and cost.

Key capabilities

DimensionDetail
Context window1,048,576 tokens (about 1M)
Max output65,536 tokens (about 64K)
Input modalitiesText, image (Google's native model also supports audio and video)
Output modalitiesText
Toolsfunction calling, structured outputs, streaming, optional thinking

Flash-Lite uses a single low pricing tier in the 2.5 family, which suits frequent production calls. See live rates in the model catalog.

Position in the 2.5 family

2.5 Family Ladder

Save the expensive compute budget for genuinely hard steps

Flash-Lite
215 t/s
Fastest, cheapest
Flash
Tunable thinking
Middle tier
Pro
Top reasoning
Backstop for the hardest
01
First layer
Flash-Lite handles classification / extraction / translation
02
Escalate
Ambiguous or failed samples go to Flash (tunable thinking budget)
03
Backstop
The hardest reasoning goes to Pro

Handle high-volume, low-risk, verifiable narrow tasks on Flash-Lite first, then escalate step by step.

Flash-Lite works best on high-volume, low-risk, clearly verifiable steps. In non-thinking mode it can reach roughly 215 tokens/s, which makes it a good fit for intent classification, keyword extraction, translation, tagging, and lightweight vision. Treat it as the first layer of a router: handle the easy, high-frequency traffic here, then escalate ambiguous or failed samples to Flash (with a larger thinking budget) or to Pro (for the hardest reasoning). This keeps response time low while saving the more expensive compute budget for genuinely complex steps.

When to use it

  • High-volume narrow tasks: classification, extraction, translation, tagging, and routing.
  • Scaled online services: RAG preprocessing, realtime summaries, and lightweight Q&A.
  • Light multimodal processing: screenshots, receipts, product images, and form photos.
  • First layer of a router: handle easy traffic before escalating to Flash or Pro.

CrossModel exposes Gemini 2.5 Flash-Lite through an OpenAI-compatible /v1/chat/completions API. Current pricing is available in the model catalog.