Inference serving Mistral 3 models
Inference serving Mistral 3 models with Red Hat AI Inference Server
Abstract
Preface Copy linkLink copied to clipboard!
Deploy Mistral 3 models with Red Hat AI Inference Server, including the Mistral Large 3 Mixture-of-Experts model and the Ministral 3 dense model family optimized for edge deployments. The Mistral 3 family includes models released under the Apache 2.0 license with open weights, suitable for on-premise and hybrid-cloud deployments. All models include native multimodal capabilities, tool calling support, and large context windows.
Chapter 1. About Mistral 3 large language models Copy linkLink copied to clipboard!
The Mistral 3 model family includes Mistral Large 3 and the Ministral 3 series, providing enterprise-ready large language models optimized for diverse deployment scenarios from single-node edge devices to multi-GPU clusters.
All Mistral 3 models are released under the Apache 2.0 license with open weights, making them suitable for on-premises and hybrid-cloud deployments. The models are fully compatible with upstream vLLM and require no custom forks for deployment with Red Hat AI Inference Server.
| Architecture | Models | Characteristics |
|---|---|---|
| Sparse MoE | Mistral Large 3 | Activates a subset of experts per token, is efficient at scale, and requires multi-accelerator deployments. |
| Dense | Ministral 3B, 8B, 14B | All parameters are active per token. Suitable for single-accelerator deployments on smaller capacity AI accelerators. |
- Mistral Large 3
Mistral Large 3 is designed for demanding enterprise workloads, delivering strong performance on advanced reasoning and analytical tasks. It supports multi-turn dialogue and vision–language use cases, including document understanding.
Architecturally, it follows a DeepSeekV3-style mixture of experts design, but with fewer, larger experts. It applies top-4 expert selection with softmax-based routing to balance efficiency and capability during inference. Additional distinctions from DeepSeekV3 include its use of softmax routing and Llama 4 RoPE scaling.
- Ministral 3 models
Ministral 3 models are suited for edge deployments with limited GPU resources, latency-sensitive applications, mobile and embedded AI applications. Each model includes built-in vision encoders for multimodal input processing. All models support a 256K context window and include multilingual capabilities. Ministral 3 dense models are released in the following variants:
- Ministral 3 14B is the highest capability dense model, suitable for complex reasoning tasks.
- Ministral 3 8B has separate embedding and output layers, with balanced performance and resource usage.
- Ministral 3 3B uses tied embeddings with shared embedding and output layers for reduced memory footprint.
Chapter 2. Deploying Mistral Large 3 with Red Hat AI Inference Server Copy linkLink copied to clipboard!
Deploy the RedHatAI/Mistral-Large-3-675B-Instruct-2512-NVFP4 model using Red Hat AI Inference Server and an NVIDIA CUDA multi-accelerator host configured for tensor parallelism.
The RedHatAI/Mistral-Large-3-675B-Instruct-2512-NVFP4 model is compressed with Red Hat AI Model Optimization Toolkit and runs with three-quarters memory usage at near-baseline quality compared to the baseline model.
Prerequisites
- You have installed Podman or Docker.
- You are logged in as a user with sudo access.
-
You have access to
registry.redhat.ioand have logged in. - You have a Hugging Face account and have generated a Hugging Face access token.
- You have access to a Linux server with 8 NVIDIA H200 AI accelerators installed.
- You have installed the relevant NVIDIA drivers.
- You have installed the NVIDIA Container Toolkit.
NVIDIA Fabric Manager is installed and running with NVSwitch.
NoteYou must have root access to start Fabric Manager.
Procedure
Log in to the Red Hat container registry:
podman login registry.redhat.io
podman login registry.redhat.ioCopy to Clipboard Copied! Toggle word wrap Toggle overflow Pull the Red Hat AI Inference Server container image:
podman pull registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0
podman pull registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0Copy to Clipboard Copied! Toggle word wrap Toggle overflow Configure SELinux to allow container device access:
sudo setsebool -P container_use_devices 1
sudo setsebool -P container_use_devices 1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a cache directory for model weights:
mkdir -p rhaiis-cache chmod g+rwX rhaiis-cache
mkdir -p rhaiis-cache chmod g+rwX rhaiis-cacheCopy to Clipboard Copied! Toggle word wrap Toggle overflow Set your Hugging Face token as an environment variable:
export HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN>
export HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check that the Red Hat AI Inference Server container can access NVIDIA AI accelerators on the host by running the following command:
podman run -it \ --device nvidia.com/gpu=all \ --entrypoint /usr/bin/nvidia-smi \ registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0
$ podman run -it \ --device nvidia.com/gpu=all \ --entrypoint /usr/bin/nvidia-smi \ registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0Copy to Clipboard Copied! Toggle word wrap Toggle overflow All AI accelerators should be returned as
Enabled.Start AI Inference Server with the Mistral Large 3 model:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
--device nvidia.com/gpu=allprovides access to all available AI accelerators. -
--shm-size=4gallocates 4 GB of shared memory for inter-process communication. -
--tokenizer-mode mistralspecifies Mistral’s native tokenizer implementation for converting text to tokens. Mistral models use a specialized tokenizer called Tekken that differs from standard HuggingFace tokenizers in how it handles special tokens and chat formatting. This flag is required because AI Inference Server does not auto-detect the Mistral tokenizer. -
--config-format mistraltells AI Inference Server to read the model configuration from Mistral’s nativeparams.jsonfile instead of the standard HuggingFaceconfig.json. Mistral models use a specific configuration schema that differs from standard HuggingFace formats. -
--load-format mistraltells AI Inference Server to load model weights from Mistral’s nativeconsolidated.safetensorscheckpoint format instead of the standard HuggingFace sharded safetensors files. -
--tensor-parallel-size 8distributes the model across 8 AI accelerators. The Mistral Large 3 675B model requires 8 AI accelerators due to its size. -
--kv-cache-dtype fp8reduces memory usage by quantizing the KV cache to FP8.
NoteIf you are using AI accelerators with less memory than NVIDIA H200, such as NVIDIA A100, you might need to lower the maximum context length to avoid out-of-memory errors. Add the
--max-model-lenargument to reduce the context length, for example--max-model-len 225000. Alternatively, you can adjust the--gpu-memory-utilizationargument to control how much GPU memory is reserved for model weights and KV cache.-
Verification
In a separate tab in your terminal, make a request to the model with the API:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The server returns a JSON response containing the model output.
Chapter 3. Deploying Ministral 3 for edge workloads with Red Hat AI Inference Server Copy linkLink copied to clipboard!
Deploy the RedHatAI/Ministral-3-14B-Instruct-2512 dense model optimized for latency-sensitive and edge deployments using Red Hat AI Inference Server on a single GPU.
Ministral 3 14B offers frontier capabilities and performance with vision capabilities. RedHatAI/Ministral-3-14B-Instruct-2512 is the instruct post-trained version in FP8.
Prerequisites
- You have installed Podman or Docker.
- You are logged in as a user with sudo access.
-
You have access to
registry.redhat.ioand have logged in. - You have a Hugging Face account and have generated a Hugging Face access token.
- You have access to a Linux server with at least one NVIDIA AI accelerator installed with a minimum of 24GB VRAM memory.
- You have installed the relevant NVIDIA drivers.
- You have installed the NVIDIA Container Toolkit.
Procedure
Log in to the Red Hat container registry:
podman login registry.redhat.io
podman login registry.redhat.ioCopy to Clipboard Copied! Toggle word wrap Toggle overflow Pull the AI Inference Server container image:
podman pull registry.redhat.io/rhaiis/vllm-cuda-rhel9:{rhaiis-version}podman pull registry.redhat.io/rhaiis/vllm-cuda-rhel9:{rhaiis-version}Copy to Clipboard Copied! Toggle word wrap Toggle overflow Set your Hugging Face token as an environment variable:
export HF_TOKEN=<your_huggingface_token>
export HF_TOKEN=<your_huggingface_token>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Start the inference server with your selected Ministral 3 model:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
--tokenizer-mode mistralspecifies Mistral’s native tokenizer implementation to use for converting text to tokens. Mistral models use a specialized tokenizer called Tekken that differs from standard HuggingFace tokenizers in how it handles special tokens and chat formatting. This flag is required because AI Inference Server does not auto-detect the Mistral tokenizer. -
--config-format mistraltells AI Inference Server to read the model configuration from Mistral’s nativeparams.jsonfile instead of the standard HuggingFaceconfig.json. Mistral models use a specific configuration schema that differs from standard HuggingFace formats. --tensor-parallel-size 1configures AI Inference Server to serve the model on a single AI accelerator. Adjust this parameter based on the number of required AI accelerators. The default value is 1.NoteThe number of AI accelerators you need depends on your use case, the available host memory, and specific model requirements.
-
--load-format mistralcontrols how the model weights are loaded from disk. This flag tells AI Inference Server to expect and properly load the native Mistral weight format.
-
Optional: For memory-constrained environments, reduce the maximum context length:
--max-model-len 32768
--max-model-len 32768Copy to Clipboard Copied! Toggle word wrap Toggle overflow This reduces memory usage at the cost of shorter context windows.
Verification
In a separate tab in your terminal, make a request to the model with the API:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow You should receive a JSON response with an array of
choicescontaining the model output with minimal latency suitable for your edge requirements.
Chapter 4. Configuring Mistral 3 multimodal features Copy linkLink copied to clipboard!
Configure Mistral 3 models to process image inputs alongside text for vision-language tasks such as image analysis and document understanding.
All Mistral 3 models include built-in vision encoders that process images at their native resolution and aspect ratio.
Prerequisites
- You have deployed a Mistral 3 model with Red Hat AI Inference Server.
Procedure
Start the inference server with multimodal input enabled:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
--limit-mm-per-prompt '{"image":10}': sets the maximum number of images per prompt to 10. Adjust based on your use case and available memory.
NoteIf you are using AI accelerators with less memory than NVIDIA H200, such as NVIDIA A100, you might need to lower the maximum context length to avoid out-of-memory errors. Add the
--max-model-lenargument to reduce the context length, for example--max-model-len 225000. Alternatively, you can adjust the--gpu-memory-utilizationargument to control how much GPU memory is reserved for model weights and KV cache.-
Optional. To run in text-only mode with a multimodal model, disable image processing to free GPU memory:
--limit-mm-per-prompt '{"image":0}'--limit-mm-per-prompt '{"image":0}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Check that the model can process an image URL. For example, run the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Alternatively, send an image as base64-encoded data:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Chapter 5. Enabling tool calling for Mistral 3 models Copy linkLink copied to clipboard!
Configure a Mistral 3 model deployment to use tool calling with the vLLM OpenAI-compatible API.
Tool calling enables the model to request that your application execute an external function by returning a structured tool_calls object. Your application runs the tool and sends the result back to the model to continue the conversation.
Prerequisites
- You have deployed a Mistral 3 Instruct model with Red Hat AI Inference Server.
- You have defined one or more tools that the model is allowed to call.
- You are running the vLLM serving container included with AI Inference Server.
Procedure
Deploy the RedHatAI/Mistral-Large-3-675B-Instruct-2512-NVFP4 model with AI Inference Server and enable tool calling:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
--enable-auto-tool-choiceallows the server to return tool calls automatically when the model requests them. -
--tool-call-parser mistraluses Mistral’s native tool calling format for parsing tool calls.
-
Verification
Send a chat completion request that includes tool definitions, for example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If the model decides a tool is needed, the response includes a
tool_callsarray instead of a final answer.NoteTool execution is performed by your application, not by the model. The model generates a structured request describing which tool to call and which arguments to use only.
Execute the requested tool in your application and send the tool result back to the model. For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The model uses the tool output to generate a final natural language response and returns it as JSON.