Inference serving Mistral 3 models


Red Hat AI Inference Server 3.3

Inference serving Mistral 3 models with Red Hat AI Inference Server

Red Hat AI Documentation Team

Abstract

Learn about inference serving Mistral 3 models, including the Mistral Large 3 Mixture-of-Experts model and the Ministral 3 dense model family optimized for edge deployments.

Preface

Deploy Mistral 3 models with Red Hat AI Inference Server, including the Mistral Large 3 Mixture-of-Experts model and the Ministral 3 dense model family optimized for edge deployments. The Mistral 3 family includes models released under the Apache 2.0 license with open weights, suitable for on-premise and hybrid-cloud deployments. All models include native multimodal capabilities, tool calling support, and large context windows.

Chapter 1. About Mistral 3 large language models

The Mistral 3 model family includes Mistral Large 3 and the Ministral 3 series, providing enterprise-ready large language models optimized for diverse deployment scenarios from single-node edge devices to multi-GPU clusters.

All Mistral 3 models are released under the Apache 2.0 license with open weights, making them suitable for on-premises and hybrid-cloud deployments. The models are fully compatible with upstream vLLM and require no custom forks for deployment with Red Hat AI Inference Server.

Expand
Table 1.1. Mistral 3 model architectures
ArchitectureModelsCharacteristics

Sparse MoE

Mistral Large 3

Activates a subset of experts per token, is efficient at scale, and requires multi-accelerator deployments.

Dense

Ministral 3B, 8B, 14B

All parameters are active per token. Suitable for single-accelerator deployments on smaller capacity AI accelerators.

Mistral Large 3

Mistral Large 3 is designed for demanding enterprise workloads, delivering strong performance on advanced reasoning and analytical tasks. It supports multi-turn dialogue and vision–language use cases, including document understanding.

Architecturally, it follows a DeepSeekV3-style mixture of experts design, but with fewer, larger experts. It applies top-4 expert selection with softmax-based routing to balance efficiency and capability during inference. Additional distinctions from DeepSeekV3 include its use of softmax routing and Llama 4 RoPE scaling.

Ministral 3 models

Ministral 3 models are suited for edge deployments with limited GPU resources, latency-sensitive applications, mobile and embedded AI applications. Each model includes built-in vision encoders for multimodal input processing. All models support a 256K context window and include multilingual capabilities. Ministral 3 dense models are released in the following variants:

  • Ministral 3 14B is the highest capability dense model, suitable for complex reasoning tasks.
  • Ministral 3 8B has separate embedding and output layers, with balanced performance and resource usage.
  • Ministral 3 3B uses tied embeddings with shared embedding and output layers for reduced memory footprint.

Deploy the RedHatAI/Mistral-Large-3-675B-Instruct-2512-NVFP4 model using Red Hat AI Inference Server and an NVIDIA CUDA multi-accelerator host configured for tensor parallelism.

Note

The RedHatAI/Mistral-Large-3-675B-Instruct-2512-NVFP4 model is compressed with Red Hat AI Model Optimization Toolkit and runs with three-quarters memory usage at near-baseline quality compared to the baseline model.

Prerequisites

  • You have installed Podman or Docker.
  • You are logged in as a user with sudo access.
  • You have access to registry.redhat.io and have logged in.
  • You have a Hugging Face account and have generated a Hugging Face access token.
  • You have access to a Linux server with 8 NVIDIA H200 AI accelerators installed.
  • You have installed the relevant NVIDIA drivers.
  • You have installed the NVIDIA Container Toolkit.
  • NVIDIA Fabric Manager is installed and running with NVSwitch.

    Note

    You must have root access to start Fabric Manager.

Procedure

  1. Log in to the Red Hat container registry:

    podman login registry.redhat.io
    Copy to Clipboard Toggle word wrap
  2. Pull the Red Hat AI Inference Server container image:

    podman pull registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0
    Copy to Clipboard Toggle word wrap
  3. Configure SELinux to allow container device access:

    sudo setsebool -P container_use_devices 1
    Copy to Clipboard Toggle word wrap
  4. Create a cache directory for model weights:

    mkdir -p rhaiis-cache
    chmod g+rwX rhaiis-cache
    Copy to Clipboard Toggle word wrap
  5. Set your Hugging Face token as an environment variable:

    export HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN>
    Copy to Clipboard Toggle word wrap
  6. Check that the Red Hat AI Inference Server container can access NVIDIA AI accelerators on the host by running the following command:

    $ podman run -it \
      --device nvidia.com/gpu=all \
      --entrypoint /usr/bin/nvidia-smi \
      registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0
    Copy to Clipboard Toggle word wrap

    All AI accelerators should be returned as Enabled.

  7. Start AI Inference Server with the Mistral Large 3 model:

    podman run --rm -it \
      --device nvidia.com/gpu=all \
      --shm-size=4g \
      -p 8000:8000 \
      -v ./rhaiis-cache:/opt/app-root/src/.cache:Z \
      --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
      --env "HF_HUB_OFFLINE=0" \
      registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0 \
        --model RedHatAI/Mistral-Large-3-675B-Instruct-2512-NVFP4 \
        --tokenizer-mode mistral \
        --config-format mistral \
        --load-format mistral \
        --tensor-parallel-size 8 \
        --kv-cache-dtype fp8 \
        --host 0.0.0.0 \
        --port 8000
    Copy to Clipboard Toggle word wrap
    • --device nvidia.com/gpu=all provides access to all available AI accelerators.
    • --shm-size=4g allocates 4 GB of shared memory for inter-process communication.
    • --tokenizer-mode mistral specifies Mistral’s native tokenizer implementation for converting text to tokens. Mistral models use a specialized tokenizer called Tekken that differs from standard HuggingFace tokenizers in how it handles special tokens and chat formatting. This flag is required because AI Inference Server does not auto-detect the Mistral tokenizer.
    • --config-format mistral tells AI Inference Server to read the model configuration from Mistral’s native params.json file instead of the standard HuggingFace config.json. Mistral models use a specific configuration schema that differs from standard HuggingFace formats.
    • --load-format mistral tells AI Inference Server to load model weights from Mistral’s native consolidated.safetensors checkpoint format instead of the standard HuggingFace sharded safetensors files.
    • --tensor-parallel-size 8 distributes the model across 8 AI accelerators. The Mistral Large 3 675B model requires 8 AI accelerators due to its size.
    • --kv-cache-dtype fp8 reduces memory usage by quantizing the KV cache to FP8.
    Note

    If you are using AI accelerators with less memory than NVIDIA H200, such as NVIDIA A100, you might need to lower the maximum context length to avoid out-of-memory errors. Add the --max-model-len argument to reduce the context length, for example --max-model-len 225000. Alternatively, you can adjust the --gpu-memory-utilization argument to control how much GPU memory is reserved for model weights and KV cache.

Verification

  1. In a separate tab in your terminal, make a request to the model with the API:

    curl http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "RedHatAI/Mistral-Large-3-675B-Instruct-2512-NVFP4",
        "messages": [
          {"role": "user", "content": "Hello, how are you?"}
        ],
        "max_tokens": 100
      }'
    Copy to Clipboard Toggle word wrap

    The server returns a JSON response containing the model output.

Deploy the RedHatAI/Ministral-3-14B-Instruct-2512 dense model optimized for latency-sensitive and edge deployments using Red Hat AI Inference Server on a single GPU.

Note

Ministral 3 14B offers frontier capabilities and performance with vision capabilities. RedHatAI/Ministral-3-14B-Instruct-2512 is the instruct post-trained version in FP8.

Prerequisites

  • You have installed Podman or Docker.
  • You are logged in as a user with sudo access.
  • You have access to registry.redhat.io and have logged in.
  • You have a Hugging Face account and have generated a Hugging Face access token.
  • You have access to a Linux server with at least one NVIDIA AI accelerator installed with a minimum of 24GB VRAM memory.
  • You have installed the relevant NVIDIA drivers.
  • You have installed the NVIDIA Container Toolkit.

Procedure

  1. Log in to the Red Hat container registry:

    podman login registry.redhat.io
    Copy to Clipboard Toggle word wrap
  2. Pull the AI Inference Server container image:

    podman pull registry.redhat.io/rhaiis/vllm-cuda-rhel9:{rhaiis-version}
    Copy to Clipboard Toggle word wrap
  3. Set your Hugging Face token as an environment variable:

    export HF_TOKEN=<your_huggingface_token>
    Copy to Clipboard Toggle word wrap
  4. Start the inference server with your selected Ministral 3 model:

    podman run --rm -it \
      --device nvidia.com/gpu=all \
      --shm-size=2g \
      -p 8000:8000 \
      --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
      --env "HF_HUB_OFFLINE=0" \
      registry.redhat.io/rhaiis/vllm-cuda-rhel9:{rhaiis-version} \
        --model RedHatAI/Ministral-3-14B-Instruct-2512 \
        --tokenizer-mode mistral \
        --config-format mistral \
        --load-format mistral \
        --tensor-parallel-size 1 \
        --host 0.0.0.0 \
        --port 8000
    Copy to Clipboard Toggle word wrap
    • --tokenizer-mode mistral specifies Mistral’s native tokenizer implementation to use for converting text to tokens. Mistral models use a specialized tokenizer called Tekken that differs from standard HuggingFace tokenizers in how it handles special tokens and chat formatting. This flag is required because AI Inference Server does not auto-detect the Mistral tokenizer.
    • --config-format mistral tells AI Inference Server to read the model configuration from Mistral’s native params.json file instead of the standard HuggingFace config.json. Mistral models use a specific configuration schema that differs from standard HuggingFace formats.
    • --tensor-parallel-size 1 configures AI Inference Server to serve the model on a single AI accelerator. Adjust this parameter based on the number of required AI accelerators. The default value is 1.

      Note

      The number of AI accelerators you need depends on your use case, the available host memory, and specific model requirements.

    • --load-format mistral controls how the model weights are loaded from disk. This flag tells AI Inference Server to expect and properly load the native Mistral weight format.
  5. Optional: For memory-constrained environments, reduce the maximum context length:

    --max-model-len 32768
    Copy to Clipboard Toggle word wrap

    This reduces memory usage at the cost of shorter context windows.

Verification

  1. In a separate tab in your terminal, make a request to the model with the API:

    curl http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "RedHatAI/Ministral-3-14B-Instruct-2512",
        "messages": [
          {"role": "user", "content": "Explain edge computing in one sentence."}
        ],
        "max_tokens": 50
      }'
    Copy to Clipboard Toggle word wrap

    You should receive a JSON response with an array of choices containing the model output with minimal latency suitable for your edge requirements.

Configure Mistral 3 models to process image inputs alongside text for vision-language tasks such as image analysis and document understanding.

All Mistral 3 models include built-in vision encoders that process images at their native resolution and aspect ratio.

Prerequisites

  • You have deployed a Mistral 3 model with Red Hat AI Inference Server.

Procedure

  1. Start the inference server with multimodal input enabled:

    podman run --rm -it \
      --device nvidia.com/gpu=all \
      --shm-size=4g \
      -p 8000:8000 \
      --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
      registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0 \
        --model RedHatAI/Mistral-Large-3-675B-Instruct-2512-NVFP4 \
        --tokenizer-mode mistral \
        --config-format mistral \
        --load-format mistral \
        --tensor-parallel-size 8 \
        --limit-mm-per-prompt '{"image":10}' \
        --host 0.0.0.0 \
        --port 8000
    Copy to Clipboard Toggle word wrap
    • --limit-mm-per-prompt '{"image":10}': sets the maximum number of images per prompt to 10. Adjust based on your use case and available memory.
    Note

    If you are using AI accelerators with less memory than NVIDIA H200, such as NVIDIA A100, you might need to lower the maximum context length to avoid out-of-memory errors. Add the --max-model-len argument to reduce the context length, for example --max-model-len 225000. Alternatively, you can adjust the --gpu-memory-utilization argument to control how much GPU memory is reserved for model weights and KV cache.

  2. Optional. To run in text-only mode with a multimodal model, disable image processing to free GPU memory:

    --limit-mm-per-prompt '{"image":0}'
    Copy to Clipboard Toggle word wrap

Verification

  1. Check that the model can process an image URL. For example, run the following command:

    $ curl http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "RedHatAI/Mistral-Large-3-675B-Instruct-2512-NVFP4",
        "messages": [
          {
            "role": "user",
            "content": [
              {"type": "text", "text": "What is shown in this image?"},
              {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/4/43/Cute_dog.jpg/1280px-Cute_dog.jpg"}}
            ]
          }
        ],
        "max_tokens": 200
      }'
    Copy to Clipboard Toggle word wrap
  2. Alternatively, send an image as base64-encoded data:

    curl http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "RedHatAI/Mistral-Large-3-675B-Instruct-2512-NVFP4",
        "messages": [
          {
            "role": "user",
            "content": [
              {"type": "text", "text": "Describe this chart."},
              {"type": "image_url", "image_url": {"url": "data:image/png;base64,_<BASE64_ENCODED_IMAGE_DATA>_"}}
            ]
          }
        ],
        "max_tokens": 500
      }'
    Copy to Clipboard Toggle word wrap

Configure a Mistral 3 model deployment to use tool calling with the vLLM OpenAI-compatible API.

Tool calling enables the model to request that your application execute an external function by returning a structured tool_calls object. Your application runs the tool and sends the result back to the model to continue the conversation.

Prerequisites

  • You have deployed a Mistral 3 Instruct model with Red Hat AI Inference Server.
  • You have defined one or more tools that the model is allowed to call.
  • You are running the vLLM serving container included with AI Inference Server.

Procedure

  1. Deploy the RedHatAI/Mistral-Large-3-675B-Instruct-2512-NVFP4 model with AI Inference Server and enable tool calling:

    podman run --rm -it \
      --device nvidia.com/gpu=all \
      --shm-size=4g \
      -p 8000:8000 \
      --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
      registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0 \
        --model RedHatAI/Mistral-Large-3-675B-Instruct-2512-NVFP4 \
        --tokenizer-mode mistral \
        --config-format mistral \
        --load-format mistral \
        --enable-auto-tool-choice \
        --tool-call-parser mistral \
        --host 0.0.0.0 \
        --port 8000
    Copy to Clipboard Toggle word wrap
    • --enable-auto-tool-choice allows the server to return tool calls automatically when the model requests them.
    • --tool-call-parser mistral uses Mistral’s native tool calling format for parsing tool calls.

Verification

  1. Send a chat completion request that includes tool definitions, for example:

    $ curl http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "RedHatAI/Mistral-Large-3-675B-Instruct-2512-NVFP4",
        "messages": [
          {
            "role": "user",
            "content": "What is the weather in Paris right now?"
          }
        ],
        "tools": [
          {
            "type": "function",
            "function": {
              "name": "get_weather",
              "description": "Get the current weather for a location",
              "parameters": {
                "type": "object",
                "properties": {
                  "location": {
                    "type": "string",
                    "description": "The city name"
                  }
                },
                "required": ["location"]
              }
            }
          }
        ],
        "tool_choice": "auto"
      }'
    Copy to Clipboard Toggle word wrap

    If the model decides a tool is needed, the response includes a tool_calls array instead of a final answer.

    Note

    Tool execution is performed by your application, not by the model. The model generates a structured request describing which tool to call and which arguments to use only.

  2. Execute the requested tool in your application and send the tool result back to the model. For example:

    $ curl http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "RedHatAI/Mistral-Large-3-675B-Instruct-2512-NVFP4",
        "messages": [
          {
            "role": "user",
            "content": "What is the weather in Paris right now?"
          },
          {
            "role": "assistant",
            "content": null,
            "tool_calls": [
              {
                "id": "call_123",
                "type": "function",
                "function": {
                  "name": "get_weather",
                  "arguments": "{\"location\": \"Paris\"}"
                }
              }
            ]
          },
          {
            "role": "tool",
            "tool_call_id": "call_123",
            "content": "The weather in Paris is 18°C and sunny."
          }
        ]
      }'
    Copy to Clipboard Toggle word wrap

    The model uses the tool output to generate a final natural language response and returns it as JSON.

Legal Notice

Copyright © Red Hat.
Except as otherwise noted below, the text of and illustrations in this documentation are licensed by Red Hat under the Creative Commons Attribution–Share Alike 3.0 Unported license . If you distribute this document or an adaptation of it, you must provide the URL for the original version.
Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law.
Red Hat, the Red Hat logo, JBoss, Hibernate, and RHCE are trademarks or registered trademarks of Red Hat, LLC. or its subsidiaries in the United States and other countries.
Linux® is the registered trademark of Linus Torvalds in the United States and other countries.
XFS is a trademark or registered trademark of Hewlett Packard Enterprise Development LP or its subsidiaries in the United States and other countries.
The OpenStack® Word Mark and OpenStack logo are trademarks or registered trademarks of the Linux Foundation, used under license.
All other trademarks are the property of their respective owners.
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2026 Red Hat
Back to top