Chapter 3. Deploying Ministral 3 for edge workloads with Red Hat AI Inference Server


Deploy the RedHatAI/Ministral-3-14B-Instruct-2512 dense model optimized for latency-sensitive and edge deployments using Red Hat AI Inference Server on a single GPU.

Note

Ministral 3 14B offers frontier capabilities and performance with vision capabilities. RedHatAI/Ministral-3-14B-Instruct-2512 is the instruct post-trained version in FP8.

Prerequisites

  • You have installed Podman or Docker.
  • You are logged in as a user with sudo access.
  • You have access to registry.redhat.io and have logged in.
  • You have a Hugging Face account and have generated a Hugging Face access token.
  • You have access to a Linux server with at least one NVIDIA AI accelerator installed with a minimum of 24GB VRAM memory.
  • You have installed the relevant NVIDIA drivers.
  • You have installed the NVIDIA Container Toolkit.

Procedure

  1. Log in to the Red Hat container registry:

    podman login registry.redhat.io
    Copy to Clipboard Toggle word wrap
  2. Pull the AI Inference Server container image:

    podman pull registry.redhat.io/rhaiis/vllm-cuda-rhel9:{rhaiis-version}
    Copy to Clipboard Toggle word wrap
  3. Set your Hugging Face token as an environment variable:

    export HF_TOKEN=<your_huggingface_token>
    Copy to Clipboard Toggle word wrap
  4. Start the inference server with your selected Ministral 3 model:

    podman run --rm -it \
      --device nvidia.com/gpu=all \
      --shm-size=2g \
      -p 8000:8000 \
      --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
      --env "HF_HUB_OFFLINE=0" \
      registry.redhat.io/rhaiis/vllm-cuda-rhel9:{rhaiis-version} \
        --model RedHatAI/Ministral-3-14B-Instruct-2512 \
        --tokenizer-mode mistral \
        --config-format mistral \
        --load-format mistral \
        --tensor-parallel-size 1 \
        --host 0.0.0.0 \
        --port 8000
    Copy to Clipboard Toggle word wrap
    • --tokenizer-mode mistral specifies Mistral’s native tokenizer implementation to use for converting text to tokens. Mistral models use a specialized tokenizer called Tekken that differs from standard HuggingFace tokenizers in how it handles special tokens and chat formatting. This flag is required because AI Inference Server does not auto-detect the Mistral tokenizer.
    • --config-format mistral tells AI Inference Server to read the model configuration from Mistral’s native params.json file instead of the standard HuggingFace config.json. Mistral models use a specific configuration schema that differs from standard HuggingFace formats.
    • --tensor-parallel-size 1 configures AI Inference Server to serve the model on a single AI accelerator. Adjust this parameter based on the number of required AI accelerators. The default value is 1.

      Note

      The number of AI accelerators you need depends on your use case, the available host memory, and specific model requirements.

    • --load-format mistral controls how the model weights are loaded from disk. This flag tells AI Inference Server to expect and properly load the native Mistral weight format.
  5. Optional: For memory-constrained environments, reduce the maximum context length:

    --max-model-len 32768
    Copy to Clipboard Toggle word wrap

    This reduces memory usage at the cost of shorter context windows.

Verification

  1. In a separate tab in your terminal, make a request to the model with the API:

    curl http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "RedHatAI/Ministral-3-14B-Instruct-2512",
        "messages": [
          {"role": "user", "content": "Explain edge computing in one sentence."}
        ],
        "max_tokens": 50
      }'
    Copy to Clipboard Toggle word wrap

    You should receive a JSON response with an array of choices containing the model output with minimal latency suitable for your edge requirements.

Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2026 Red Hat
Back to top