Chapter 4. Configuring Mistral 3 multimodal features
Configure Mistral 3 models to process image inputs alongside text for vision-language tasks such as image analysis and document understanding.
All Mistral 3 models include built-in vision encoders that process images at their native resolution and aspect ratio.
Prerequisites
- You have deployed a Mistral 3 model with Red Hat AI Inference Server.
Procedure
Start the inference server with multimodal input enabled:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
--limit-mm-per-prompt '{"image":10}': sets the maximum number of images per prompt to 10. Adjust based on your use case and available memory.
NoteIf you are using AI accelerators with less memory than NVIDIA H200, such as NVIDIA A100, you might need to lower the maximum context length to avoid out-of-memory errors. Add the
--max-model-lenargument to reduce the context length, for example--max-model-len 225000. Alternatively, you can adjust the--gpu-memory-utilizationargument to control how much GPU memory is reserved for model weights and KV cache.-
Optional. To run in text-only mode with a multimodal model, disable image processing to free GPU memory:
--limit-mm-per-prompt '{"image":0}'--limit-mm-per-prompt '{"image":0}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Check that the model can process an image URL. For example, run the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Alternatively, send an image as base64-encoded data:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow