Gemini 2.5 Flash Lite · Model guide

Gemini 2.5 Flash-Lite

The fastest, cheapest, lowest-latency tier in the 2.5 family

Context window

1,048,576 tokens

Non-thinking throughput

215

tokens/s

Thinking

Optional

optional thinking

High-volume narrow tasks

Classification, extraction, translation, tagging, routing

Scaled online services

RAG preprocessing, realtime summaries, lightweight Q&A

First layer of a router

Escalate to Flash / Pro only on failure or ambiguity

Shares the 1M context and native multimodal foundation with Flash and Pro, but pushes the tradeoff toward speed and cost.

Overview

Gemini 2.5 Flash-Lite is Google DeepMind's lightweight reasoning model, released on June 17, 2025. It is the fastest, lowest-latency, lowest-cost tier in the Gemini 2.5 family, designed for large-scale calls where throughput and response time matter as much as quality.

CrossModel exposes it as gemini/gemini-2.5-flash-lite. It shares the 2.5 family's 1M context and native multimodal foundation with Flash and Pro, but pushes the tradeoff toward speed and cost.

Key capabilities

Dimension	Detail
Context window	1,048,576 tokens (about 1M)
Max output	65,536 tokens (about 64K)
Input modalities	Text, image (Google's native model also supports audio and video)
Output modalities	Text
Tools	function calling, structured outputs, streaming, optional thinking

Flash-Lite uses a single low pricing tier in the 2.5 family, which suits frequent production calls. See live rates in the model catalog.

Position in the 2.5 family

2.5 Family Ladder

Save the expensive compute budget for genuinely hard steps

Flash-Lite

215 t/s

Fastest, cheapest

Flash

Tunable thinking

Middle tier

Pro

Top reasoning

Backstop for the hardest

First layer

Flash-Lite handles classification / extraction / translation

Escalate

Ambiguous or failed samples go to Flash (tunable thinking budget)

Backstop

The hardest reasoning goes to Pro

Handle high-volume, low-risk, verifiable narrow tasks on Flash-Lite first, then escalate step by step.

Flash-Lite works best on high-volume, low-risk, clearly verifiable steps. In non-thinking mode it can reach roughly 215 tokens/s, which makes it a good fit for intent classification, keyword extraction, translation, tagging, and lightweight vision. Treat it as the first layer of a router: handle the easy, high-frequency traffic here, then escalate ambiguous or failed samples to Flash (with a larger thinking budget) or to Pro (for the hardest reasoning). This keeps response time low while saving the more expensive compute budget for genuinely complex steps.

When to use it

High-volume narrow tasks: classification, extraction, translation, tagging, and routing.
Scaled online services: RAG preprocessing, realtime summaries, and lightweight Q&A.
Light multimodal processing: screenshots, receipts, product images, and form photos.
First layer of a router: handle easy traffic before escalating to Flash or Pro.

CrossModel exposes Gemini 2.5 Flash-Lite through an OpenAI-compatible /v1/chat/completions API. Current pricing is available in the model catalog.