CrossModel

Multimodal input

Models with multimodal support can take image, audio, or video input. The content-block format differs by protocol, and whether a model actually supports a given media type depends on the model.

Chat Completions

Chat Completions sends multimodal content as an array in messages[].content.

Image input

{
  "model": "openai/gpt-5-mini",
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "What's in this image?" },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://example.com/image.png",
            "detail": "low"
          }
        }
      ]
    }
  ]
}

image_url.url can be a public image URL or a data:image/...;base64,... data URL.

Audio input

Audio uses the input_audio content block. Per the OpenAI Chat Completions format, data is the base64-encoded audio and format is the audio format, e.g. wav or mp3.

{
  "model": "openai/gpt-5-mini",
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "Transcribe this audio and summarize the key points." },
        {
          "type": "input_audio",
          "input_audio": {
            "data": "BASE64_AUDIO_DATA",
            "format": "wav"
          }
        }
      ]
    }
  ]
}

Video input

Pass a public video URL with the video_url content block:

{
  "model": "openai/gpt-5-mini",
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "Summarize what happens in this video." },
        {
          "type": "video_url",
          "video_url": {
            "url": "https://example.com/video.mp4"
          }
        }
      ]
    }
  ]
}

Some models also accept a video split into frames:

{
  "model": "openai/gpt-5-mini",
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "Describe what happens, based on these key frames." },
        {
          "type": "video",
          "video": [
            "https://example.com/frame-1.jpg",
            "https://example.com/frame-2.jpg",
            "https://example.com/frame-3.jpg"
          ]
        }
      ]
    }
  ]
}

Responses image input

{
  "model": "openai/gpt-5-mini",
  "input": [
    {
      "type": "message",
      "role": "user",
      "content": [
        { "type": "input_text", "text": "Describe this image" },
        {
          "type": "input_image",
          "image_url": "https://example.com/image.png",
          "detail": "low"
        }
      ]
    }
  ]
}

Messages image input

{
  "model": "anthropic/claude-sonnet-4.6",
  "max_tokens": 1024,
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "Describe this image" },
        {
          "type": "image",
          "source": {
            "type": "url",
            "url": "https://example.com/image.png"
          }
        }
      ]
    }
  ]
}

Tips

  • Prefer public HTTPS image URLs, or a standard data:image/...;base64,... data URL.
  • Chat Completions audio uses base64 data; if your audio lives at a remote URL, download and encode it server-side before passing input_audio.data.
  • Make sure the audio base64 string has no line breaks or extra prefix/suffix.
  • The video URL must be reachable by the provider; for longer videos, extract key frames or clips first.
  • More images, audio, or video generally means more tokens or processing time.
  • If the model supports detail, pick low, auto, or high to fit the task.
  • Validate image format, size, and recognition quality on your target model before shipping.