---
title: Multimodal input
description: Send images, audio, and video in Chat Completions, Responses, and Messages.
---

# Multimodal input

Models with multimodal support can take image, audio, or video input. The content-block format differs by protocol, and whether a model actually supports a given media type depends on the model.

## Chat Completions

Chat Completions sends multimodal content as an array in `messages[].content`.

### Image input

```json
{
  "model": "openai/gpt-5-mini",
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "What's in this image?" },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://example.com/image.png",
            "detail": "low"
          }
        }
      ]
    }
  ]
}
```

`image_url.url` can be a public image URL or a `data:image/...;base64,...` data URL.

### Audio input

Audio uses the `input_audio` content block. Per the OpenAI Chat Completions format, `data` is the base64-encoded audio and `format` is the audio format, e.g. `wav` or `mp3`.

```json
{
  "model": "openai/gpt-5-mini",
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "Transcribe this audio and summarize the key points." },
        {
          "type": "input_audio",
          "input_audio": {
            "data": "BASE64_AUDIO_DATA",
            "format": "wav"
          }
        }
      ]
    }
  ]
}
```

### Video input

Pass a public video URL with the `video_url` content block:

```json
{
  "model": "openai/gpt-5-mini",
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "Summarize what happens in this video." },
        {
          "type": "video_url",
          "video_url": {
            "url": "https://example.com/video.mp4"
          }
        }
      ]
    }
  ]
}
```

Some models also accept a video split into frames:

```json
{
  "model": "openai/gpt-5-mini",
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "Describe what happens, based on these key frames." },
        {
          "type": "video",
          "video": [
            "https://example.com/frame-1.jpg",
            "https://example.com/frame-2.jpg",
            "https://example.com/frame-3.jpg"
          ]
        }
      ]
    }
  ]
}
```

## Responses image input

```json
{
  "model": "openai/gpt-5-mini",
  "input": [
    {
      "type": "message",
      "role": "user",
      "content": [
        { "type": "input_text", "text": "Describe this image" },
        {
          "type": "input_image",
          "image_url": "https://example.com/image.png",
          "detail": "low"
        }
      ]
    }
  ]
}
```

## Messages image input

```json
{
  "model": "anthropic/claude-sonnet-4.6",
  "max_tokens": 1024,
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "Describe this image" },
        {
          "type": "image",
          "source": {
            "type": "url",
            "url": "https://example.com/image.png"
          }
        }
      ]
    }
  ]
}
```

## Tips

- Prefer public HTTPS image URLs, or a standard `data:image/...;base64,...` data URL.
- Chat Completions audio uses base64 data; if your audio lives at a remote URL, download and encode it server-side before passing `input_audio.data`.
- Make sure the audio base64 string has no line breaks or extra prefix/suffix.
- The video URL must be reachable by the provider; for longer videos, extract key frames or clips first.
- More images, audio, or video generally means more tokens or processing time.
- If the model supports `detail`, pick `low`, `auto`, or `high` to fit the task.
- Validate image format, size, and recognition quality on your target model before shipping.
