Multimodal input
Models with multimodal support can take image, audio, or video input. The content-block format differs by protocol, and whether a model actually supports a given media type depends on the model.
Chat Completions
Chat Completions sends multimodal content as an array in messages[].content.
Image input
{
"model": "openai/gpt-5-mini",
"messages": [
{
"role": "user",
"content": [
{ "type": "text", "text": "What's in this image?" },
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.png",
"detail": "low"
}
}
]
}
]
}image_url.url can be a public image URL or a data:image/...;base64,... data URL.
Audio input
Audio uses the input_audio content block. Per the OpenAI Chat Completions format, data is the base64-encoded audio and format is the audio format, e.g. wav or mp3.
{
"model": "openai/gpt-5-mini",
"messages": [
{
"role": "user",
"content": [
{ "type": "text", "text": "Transcribe this audio and summarize the key points." },
{
"type": "input_audio",
"input_audio": {
"data": "BASE64_AUDIO_DATA",
"format": "wav"
}
}
]
}
]
}Video input
Pass a public video URL with the video_url content block:
{
"model": "openai/gpt-5-mini",
"messages": [
{
"role": "user",
"content": [
{ "type": "text", "text": "Summarize what happens in this video." },
{
"type": "video_url",
"video_url": {
"url": "https://example.com/video.mp4"
}
}
]
}
]
}Some models also accept a video split into frames:
{
"model": "openai/gpt-5-mini",
"messages": [
{
"role": "user",
"content": [
{ "type": "text", "text": "Describe what happens, based on these key frames." },
{
"type": "video",
"video": [
"https://example.com/frame-1.jpg",
"https://example.com/frame-2.jpg",
"https://example.com/frame-3.jpg"
]
}
]
}
]
}Responses image input
{
"model": "openai/gpt-5-mini",
"input": [
{
"type": "message",
"role": "user",
"content": [
{ "type": "input_text", "text": "Describe this image" },
{
"type": "input_image",
"image_url": "https://example.com/image.png",
"detail": "low"
}
]
}
]
}Messages image input
{
"model": "anthropic/claude-sonnet-4.6",
"max_tokens": 1024,
"messages": [
{
"role": "user",
"content": [
{ "type": "text", "text": "Describe this image" },
{
"type": "image",
"source": {
"type": "url",
"url": "https://example.com/image.png"
}
}
]
}
]
}Tips
- Prefer public HTTPS image URLs, or a standard
data:image/...;base64,...data URL. - Chat Completions audio uses base64 data; if your audio lives at a remote URL, download and encode it server-side before passing
input_audio.data. - Make sure the audio base64 string has no line breaks or extra prefix/suffix.
- The video URL must be reachable by the provider; for longer videos, extract key frames or clips first.
- More images, audio, or video generally means more tokens or processing time.
- If the model supports
detail, picklow,auto, orhighto fit the task. - Validate image format, size, and recognition quality on your target model before shipping.