CrossModel
Back to model catalog

Xiaomi · Model guide

MiMo V2.5

xiaomi/mimo-v2.5
Modalities
TextImageText
Context
1M
Max output
128K
MiMo V2.5

A natively omni-modal open-source agent model

Architecture
310B
15B active sparse MoE
Context window
1M
tokens
Max output
128K
tokens
License
MIT
Commercial · retrainable
Omni-modal understanding
Mixed text-image docs, chart QA, long-video understanding
Multimodal agents
Agent tasks in screenshot / UI environments
Long context
1M context for retrieval over large source sets

The model natively supports text / image / audio / video; the CrossModel gateway currently exposes text and image input with text output.

Overview

MiMo-V2.5 is a natively omni-modal agent model open-sourced (MIT license) by Xiaomi's MiMo team in April 2026, alongside the flagship MiMo-V2.5-Pro. It unifies understanding of text, image, audio, and video in one model — a "single model for multimodality + agents" approach — with a 310B total / 15B active sparse MoE and a native 1M token context window.

Unlike approaches that bolt vision onto a text model, MiMo-V2.5's multimodality is trained natively, so its image and video scores do not come at the cost of its coding and agent abilities.

Key capabilities

DimensionDetail
Context window1,000,000 tokens
Max output128,000 tokens
Input modalitiesText, image (model also natively supports audio / video)
Output modalitiesText
Architecture310B total / 15B active sparse MoE
Toolsfunction calling, JSON output, streaming, Thinking, vision

The open-source model natively supports text / image / audio / video; the CrossModel gateway currently exposes text and image input with text output. See current pricing in the model catalog.

Benchmarks

MiMo-V2.5's evaluation axis is omni-modal coverage: lead on image and video understanding, hold coding and agent ability at a usable level for its size, and stay stable in million-token context.

Multimodal understanding

MiMo-V2.5 multimodal benchmarks

On image understanding, MiMo-V2.5 reaches HR-Bench 4k 88.5, OmniDocBench 87.2, MMMU-Pro 77.9, and CharXiv RQ 81.0, leading Claude Opus 4.6 / Sonnet 4.6 and Gemini 3 Pro on several metrics; on video it posts Video-MME 87.7, DailyOmni 83.5, and VideoHolmes 64.0. Omni-modal coverage is its biggest differentiator — one model handles text-image documents, charts, and long video.

Coding and agents

MiMo-V2.5 coding and agent benchmarks

For a multimodal model, MiMo-V2.5's coding and agent ability is far from weak: MiMo Coding Bench 71.8, Claw-Eval Text 62.3, Terminal-Bench 2.0 65.8, and SWE-Bench Pro 56.1 — well-balanced for its size and enough to drive agent tasks in UI- and screenshot-bearing environments.

Long context

MiMo-V2.5 GraphWalks long-context performance

On GraphWalks, extending input to 1M tokens still holds BFS 0.54 / Parents 0.87 F1, while the previous V2-Omni degraded sharply at much shorter lengths — giving it a genuinely usable million-token multimodal context for retrieval over large source sets.

When to use it

  • Multimodal understanding: mixed text-image documents, chart QA, long-video understanding and summarization.
  • Multimodal agents: agent tasks executed in screenshot / UI environments.
  • Long-context knowledge work: 1M context for retrieval and synthesis over large document and asset sets.
  • Local deployment: MIT license, commercial use and retraining allowed, with official SGLang / vLLM support.

CrossModel exposes MiMo-V2.5 through both the OpenAI-compatible /v1/chat/completions and Anthropic-compatible /v1/messages APIs (text + image). Current pricing is available in the model catalog.