Chapter 2. Deploying Mistral Large 3 with Red Hat AI Inference Server

Deploy the RedHatAI/Mistral-Large-3-675B-Instruct-2512-NVFP4 model using Red Hat AI Inference Server and an NVIDIA CUDA multi-accelerator host configured for tensor parallelism.

Note

The RedHatAI/Mistral-Large-3-675B-Instruct-2512-NVFP4 model is compressed with Red Hat AI Model Optimization Toolkit and runs with three-quarters memory usage at near-baseline quality compared to the baseline model.

Prerequisites

You have installed Podman or Docker.
You are logged in as a user with sudo access.
You have access to registry.redhat.io and have logged in.
You have a Hugging Face account and have generated a Hugging Face access token.
You have access to a Linux server with 8 NVIDIA H200 AI accelerators installed.
You have installed the relevant NVIDIA drivers.
You have installed the NVIDIA Container Toolkit.
NVIDIA Fabric Manager is installed and running with NVSwitch.
Note
You must have root access to start Fabric Manager.

Procedure

Pull the Red Hat AI Inference Server container image:

podman pull registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0

Configure SELinux to allow container device access:
```
sudo setsebool -P container_use_devices 1
```

Create a cache directory for model weights:

mkdir -p rhaiis-cache
chmod g+rwX rhaiis-cache

Set your Hugging Face token as an environment variable:
```
export HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN>
```
Check that the Red Hat AI Inference Server container can access NVIDIA AI accelerators on the host by running the following command:
```
$ podman run -it \
  --device nvidia.com/gpu=all \
  --entrypoint /usr/bin/nvidia-smi \
  registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0
```
All AI accelerators should be returned as Enabled.
Start AI Inference Server with the Mistral Large 3 model:
```
podman run --rm -it \
  --device nvidia.com/gpu=all \
  --shm-size=4g \
  -p 8000:8000 \
  -v ./rhaiis-cache:/opt/app-root/src/.cache:Z \
  --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
  --env "HF_HUB_OFFLINE=0" \
  registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0 \
    --model RedHatAI/Mistral-Large-3-675B-Instruct-2512-NVFP4 \
    --tokenizer-mode mistral \
    --config-format mistral \
    --load-format mistral \
    --tensor-parallel-size 8 \
    --kv-cache-dtype fp8 \
    --host 0.0.0.0 \
    --port 8000
```
- --device nvidia.com/gpu=all provides access to all available AI accelerators.
- --shm-size=4g allocates 4 GB of shared memory for inter-process communication.
- --tokenizer-mode mistral specifies Mistral’s native tokenizer implementation for converting text to tokens. Mistral models use a specialized tokenizer called Tekken that differs from standard HuggingFace tokenizers in how it handles special tokens and chat formatting. This flag is required because AI Inference Server does not auto-detect the Mistral tokenizer.
- --config-format mistral tells AI Inference Server to read the model configuration from Mistral’s native params.json file instead of the standard HuggingFace config.json. Mistral models use a specific configuration schema that differs from standard HuggingFace formats.
- --load-format mistral tells AI Inference Server to load model weights from Mistral’s native consolidated.safetensors checkpoint format instead of the standard HuggingFace sharded safetensors files.
- --tensor-parallel-size 8 distributes the model across 8 AI accelerators. The Mistral Large 3 675B model requires 8 AI accelerators due to its size.
- --kv-cache-dtype fp8 reduces memory usage by quantizing the KV cache to FP8.
Note
If you are using AI accelerators with less memory than NVIDIA H200, such as NVIDIA A100, you might need to lower the maximum context length to avoid out-of-memory errors. Add the --max-model-len argument to reduce the context length, for example --max-model-len 225000. Alternatively, you can adjust the --gpu-memory-utilization argument to control how much GPU memory is reserved for model weights and KV cache.

Verification

In a separate tab in your terminal, make a request to the model with the API:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "RedHatAI/Mistral-Large-3-675B-Instruct-2512-NVFP4",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ],
    "max_tokens": 100
  }'

The server returns a JSON response containing the model output.

Chapter 2. Deploying Mistral Large 3 with Red Hat AI Inference Server

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links