Multimodal input

Models with multimodal support can take image, audio, or video input. The content-block format differs by protocol, and whether a model actually supports a given media type depends on the model.

Chat Completions

Chat Completions sends multimodal content as an array in messages[].content.

Image input

{
  "model": "openai/gpt-5-mini",
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "What's in this image?" },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://example.com/image.png",
            "detail": "low"
          }
        }
      ]
    }
  ]
}

image_url.url can be a public image URL or a data:image/...;base64,... data URL.

Audio input

Audio uses the input_audio content block. Per the OpenAI Chat Completions format, data is the base64-encoded audio and format is the audio format, e.g. wav or mp3.

{
  "model": "openai/gpt-5-mini",
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "Transcribe this audio and summarize the key points." },
        {
          "type": "input_audio",
          "input_audio": {
            "data": "BASE64_AUDIO_DATA",
            "format": "wav"
          }
        }
      ]
    }
  ]
}

Video input

Pass a public video URL with the video_url content block:

{
  "model": "openai/gpt-5-mini",
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "Summarize what happens in this video." },
        {
          "type": "video_url",
          "video_url": {
            "url": "https://example.com/video.mp4"
          }
        }
      ]
    }
  ]
}

Some models also accept a video split into frames:

{
  "model": "openai/gpt-5-mini",
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "Describe what happens, based on these key frames." },
        {
          "type": "video",
          "video": [
            "https://example.com/frame-1.jpg",
            "https://example.com/frame-2.jpg",
            "https://example.com/frame-3.jpg"
          ]
        }
      ]
    }
  ]
}

Responses image input

{
  "model": "openai/gpt-5-mini",
  "input": [
    {
      "type": "message",
      "role": "user",
      "content": [
        { "type": "input_text", "text": "Describe this image" },
        {
          "type": "input_image",
          "image_url": "https://example.com/image.png",
          "detail": "low"
        }
      ]
    }
  ]
}

Messages image input

{
  "model": "anthropic/claude-sonnet-4.6",
  "max_tokens": 1024,
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "Describe this image" },
        {
          "type": "image",
          "source": {
            "type": "url",
            "url": "https://example.com/image.png"
          }
        }
      ]
    }
  ]
}

Tips

Prefer public HTTPS image URLs, or a standard data:image/...;base64,... data URL.
Chat Completions audio uses base64 data; if your audio lives at a remote URL, download and encode it server-side before passing input_audio.data.
Make sure the audio base64 string has no line breaks or extra prefix/suffix.
The video URL must be reachable by the provider; for longer videos, extract key frames or clips first.
More images, audio, or video generally means more tokens or processing time.
If the model supports detail, pick low, auto, or high to fit the task.
Validate image format, size, and recognition quality on your target model before shipping.