Chapter 3. Deploying Ministral 3 for edge workloads with Red Hat AI Inference Server
Deploy the RedHatAI/Ministral-3-14B-Instruct-2512 dense model optimized for latency-sensitive and edge deployments using Red Hat AI Inference Server on a single GPU.
Ministral 3 14B offers frontier capabilities and performance with vision capabilities. RedHatAI/Ministral-3-14B-Instruct-2512 is the instruct post-trained version in FP8.
Prerequisites
- You have installed Podman or Docker.
- You are logged in as a user with sudo access.
-
You have access to
registry.redhat.ioand have logged in. - You have a Hugging Face account and have generated a Hugging Face access token.
- You have access to a Linux server with at least one NVIDIA AI accelerator installed with a minimum of 24GB VRAM memory.
- You have installed the relevant NVIDIA drivers.
- You have installed the NVIDIA Container Toolkit.
Procedure
Log in to the Red Hat container registry:
podman login registry.redhat.io
podman login registry.redhat.ioCopy to Clipboard Copied! Toggle word wrap Toggle overflow Pull the AI Inference Server container image:
podman pull registry.redhat.io/rhaiis/vllm-cuda-rhel9:{rhaiis-version}podman pull registry.redhat.io/rhaiis/vllm-cuda-rhel9:{rhaiis-version}Copy to Clipboard Copied! Toggle word wrap Toggle overflow Set your Hugging Face token as an environment variable:
export HF_TOKEN=<your_huggingface_token>
export HF_TOKEN=<your_huggingface_token>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Start the inference server with your selected Ministral 3 model:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
--tokenizer-mode mistralspecifies Mistral’s native tokenizer implementation to use for converting text to tokens. Mistral models use a specialized tokenizer called Tekken that differs from standard HuggingFace tokenizers in how it handles special tokens and chat formatting. This flag is required because AI Inference Server does not auto-detect the Mistral tokenizer. -
--config-format mistraltells AI Inference Server to read the model configuration from Mistral’s nativeparams.jsonfile instead of the standard HuggingFaceconfig.json. Mistral models use a specific configuration schema that differs from standard HuggingFace formats. --tensor-parallel-size 1configures AI Inference Server to serve the model on a single AI accelerator. Adjust this parameter based on the number of required AI accelerators. The default value is 1.NoteThe number of AI accelerators you need depends on your use case, the available host memory, and specific model requirements.
-
--load-format mistralcontrols how the model weights are loaded from disk. This flag tells AI Inference Server to expect and properly load the native Mistral weight format.
-
Optional: For memory-constrained environments, reduce the maximum context length:
--max-model-len 32768
--max-model-len 32768Copy to Clipboard Copied! Toggle word wrap Toggle overflow This reduces memory usage at the cost of shorter context windows.
Verification
In a separate tab in your terminal, make a request to the model with the API:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow You should receive a JSON response with an array of
choicescontaining the model output with minimal latency suitable for your edge requirements.