Chapter 2. Deploying Mistral Large 3 with Red Hat AI Inference Server


Deploy the RedHatAI/Mistral-Large-3-675B-Instruct-2512-NVFP4 model using Red Hat AI Inference Server and an NVIDIA CUDA multi-accelerator host configured for tensor parallelism.

Note

The RedHatAI/Mistral-Large-3-675B-Instruct-2512-NVFP4 model is compressed with Red Hat AI Model Optimization Toolkit and runs with three-quarters memory usage at near-baseline quality compared to the baseline model.

Prerequisites

  • You have installed Podman or Docker.
  • You are logged in as a user with sudo access.
  • You have access to registry.redhat.io and have logged in.
  • You have a Hugging Face account and have generated a Hugging Face access token.
  • You have access to a Linux server with 8 NVIDIA H200 AI accelerators installed.
  • You have installed the relevant NVIDIA drivers.
  • You have installed the NVIDIA Container Toolkit.
  • NVIDIA Fabric Manager is installed and running with NVSwitch.

    Note

    You must have root access to start Fabric Manager.

Procedure

  1. Log in to the Red Hat container registry:

    podman login registry.redhat.io
    Copy to Clipboard Toggle word wrap
  2. Pull the Red Hat AI Inference Server container image:

    podman pull registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0
    Copy to Clipboard Toggle word wrap
  3. Configure SELinux to allow container device access:

    sudo setsebool -P container_use_devices 1
    Copy to Clipboard Toggle word wrap
  4. Create a cache directory for model weights:

    mkdir -p rhaiis-cache
    chmod g+rwX rhaiis-cache
    Copy to Clipboard Toggle word wrap
  5. Set your Hugging Face token as an environment variable:

    export HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN>
    Copy to Clipboard Toggle word wrap
  6. Check that the Red Hat AI Inference Server container can access NVIDIA AI accelerators on the host by running the following command:

    $ podman run -it \
      --device nvidia.com/gpu=all \
      --entrypoint /usr/bin/nvidia-smi \
      registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0
    Copy to Clipboard Toggle word wrap

    All AI accelerators should be returned as Enabled.

  7. Start AI Inference Server with the Mistral Large 3 model:

    podman run --rm -it \
      --device nvidia.com/gpu=all \
      --shm-size=4g \
      -p 8000:8000 \
      -v ./rhaiis-cache:/opt/app-root/src/.cache:Z \
      --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
      --env "HF_HUB_OFFLINE=0" \
      registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0 \
        --model RedHatAI/Mistral-Large-3-675B-Instruct-2512-NVFP4 \
        --tokenizer-mode mistral \
        --config-format mistral \
        --load-format mistral \
        --tensor-parallel-size 8 \
        --kv-cache-dtype fp8 \
        --host 0.0.0.0 \
        --port 8000
    Copy to Clipboard Toggle word wrap
    • --device nvidia.com/gpu=all provides access to all available AI accelerators.
    • --shm-size=4g allocates 4 GB of shared memory for inter-process communication.
    • --tokenizer-mode mistral specifies Mistral’s native tokenizer implementation for converting text to tokens. Mistral models use a specialized tokenizer called Tekken that differs from standard HuggingFace tokenizers in how it handles special tokens and chat formatting. This flag is required because AI Inference Server does not auto-detect the Mistral tokenizer.
    • --config-format mistral tells AI Inference Server to read the model configuration from Mistral’s native params.json file instead of the standard HuggingFace config.json. Mistral models use a specific configuration schema that differs from standard HuggingFace formats.
    • --load-format mistral tells AI Inference Server to load model weights from Mistral’s native consolidated.safetensors checkpoint format instead of the standard HuggingFace sharded safetensors files.
    • --tensor-parallel-size 8 distributes the model across 8 AI accelerators. The Mistral Large 3 675B model requires 8 AI accelerators due to its size.
    • --kv-cache-dtype fp8 reduces memory usage by quantizing the KV cache to FP8.
    Note

    If you are using AI accelerators with less memory than NVIDIA H200, such as NVIDIA A100, you might need to lower the maximum context length to avoid out-of-memory errors. Add the --max-model-len argument to reduce the context length, for example --max-model-len 225000. Alternatively, you can adjust the --gpu-memory-utilization argument to control how much GPU memory is reserved for model weights and KV cache.

Verification

  1. In a separate tab in your terminal, make a request to the model with the API:

    curl http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "RedHatAI/Mistral-Large-3-675B-Instruct-2512-NVFP4",
        "messages": [
          {"role": "user", "content": "Hello, how are you?"}
        ],
        "max_tokens": 100
      }'
    Copy to Clipboard Toggle word wrap

    The server returns a JSON response containing the model output.

Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2026 Red Hat
Back to top