Chapter 2. Deploying Mistral Large 3 with Red Hat AI Inference Server
Deploy the RedHatAI/Mistral-Large-3-675B-Instruct-2512-NVFP4 model using Red Hat AI Inference Server and an NVIDIA CUDA multi-accelerator host configured for tensor parallelism.
The RedHatAI/Mistral-Large-3-675B-Instruct-2512-NVFP4 model is compressed with Red Hat AI Model Optimization Toolkit and runs with three-quarters memory usage at near-baseline quality compared to the baseline model.
Prerequisites
- You have installed Podman or Docker.
- You are logged in as a user with sudo access.
-
You have access to
registry.redhat.ioand have logged in. - You have a Hugging Face account and have generated a Hugging Face access token.
- You have access to a Linux server with 8 NVIDIA H200 AI accelerators installed.
- You have installed the relevant NVIDIA drivers.
- You have installed the NVIDIA Container Toolkit.
NVIDIA Fabric Manager is installed and running with NVSwitch.
NoteYou must have root access to start Fabric Manager.
Procedure
Log in to the Red Hat container registry:
podman login registry.redhat.io
podman login registry.redhat.ioCopy to Clipboard Copied! Toggle word wrap Toggle overflow Pull the Red Hat AI Inference Server container image:
podman pull registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0
podman pull registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0Copy to Clipboard Copied! Toggle word wrap Toggle overflow Configure SELinux to allow container device access:
sudo setsebool -P container_use_devices 1
sudo setsebool -P container_use_devices 1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a cache directory for model weights:
mkdir -p rhaiis-cache chmod g+rwX rhaiis-cache
mkdir -p rhaiis-cache chmod g+rwX rhaiis-cacheCopy to Clipboard Copied! Toggle word wrap Toggle overflow Set your Hugging Face token as an environment variable:
export HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN>
export HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check that the Red Hat AI Inference Server container can access NVIDIA AI accelerators on the host by running the following command:
podman run -it \ --device nvidia.com/gpu=all \ --entrypoint /usr/bin/nvidia-smi \ registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0
$ podman run -it \ --device nvidia.com/gpu=all \ --entrypoint /usr/bin/nvidia-smi \ registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0Copy to Clipboard Copied! Toggle word wrap Toggle overflow All AI accelerators should be returned as
Enabled.Start AI Inference Server with the Mistral Large 3 model:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
--device nvidia.com/gpu=allprovides access to all available AI accelerators. -
--shm-size=4gallocates 4 GB of shared memory for inter-process communication. -
--tokenizer-mode mistralspecifies Mistral’s native tokenizer implementation for converting text to tokens. Mistral models use a specialized tokenizer called Tekken that differs from standard HuggingFace tokenizers in how it handles special tokens and chat formatting. This flag is required because AI Inference Server does not auto-detect the Mistral tokenizer. -
--config-format mistraltells AI Inference Server to read the model configuration from Mistral’s nativeparams.jsonfile instead of the standard HuggingFaceconfig.json. Mistral models use a specific configuration schema that differs from standard HuggingFace formats. -
--load-format mistraltells AI Inference Server to load model weights from Mistral’s nativeconsolidated.safetensorscheckpoint format instead of the standard HuggingFace sharded safetensors files. -
--tensor-parallel-size 8distributes the model across 8 AI accelerators. The Mistral Large 3 675B model requires 8 AI accelerators due to its size. -
--kv-cache-dtype fp8reduces memory usage by quantizing the KV cache to FP8.
NoteIf you are using AI accelerators with less memory than NVIDIA H200, such as NVIDIA A100, you might need to lower the maximum context length to avoid out-of-memory errors. Add the
--max-model-lenargument to reduce the context length, for example--max-model-len 225000. Alternatively, you can adjust the--gpu-memory-utilizationargument to control how much GPU memory is reserved for model weights and KV cache.-
Verification
In a separate tab in your terminal, make a request to the model with the API:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The server returns a JSON response containing the model output.