Chapter 2. Using custom chat templates with AI Inference Server and Podman


You can use chat templates in Red Hat AI Inference Server to control how conversations and tool calls are formatted when inference serving a language model. You can use default templates provided with the model or mount custom templates for specialized tool calling formats.

Note

If you don’t need a custom template, you can inference serve the model referencing the appropriate template that is included in the AI Inference Server container image. The chat templates in the vLLM repository examples folder are included in the AI Inference Server container images at /opt/app-root/template/ by default.

Different models require different specific tool calling parameters and combinations of template and parser settings. The default chat template is set in the chat_template field in tokenizer_config.json file in the model repository. Not all models include a default chat template. If the model supports chat templates and does not provide a template, you must create or mount a custom template.

For more information about vLLM tool calling configuration, see Tool calling.

Prerequisites

  • You have installed Podman or Docker.
  • You have deployed Red Hat AI Inference Server on a host that has one or more AI Accelerators configured.
  • You are inference serving a model that supports chat templates.
  • You have a Hugging Face account and have generated a Hugging Face access token.
  • You have configured SELinux to allow device access if your system has SELinux enabled.
  • You have created a rhaiis-cache/ folder for mounting as a volume in the container and have adjusted the container permissions so that the container can use it.
  • You are familiar with the Jinja2 template syntax and have created the custom chat template for the model.

Procedure

  1. Create the model/ folder and download the model into it. For example, clone the Llama-3.2-1B-Instruct-FP8 model from Hugging Face:

    $ mkdir -p model && huggingface-cli download RedHatAI/Llama-3.2-1B-Instruct-FP8 --local-dir model
    Copy to Clipboard Toggle word wrap
  2. Create a custom template file for the Llama-3.2-1B-Instruct-FP8 model and add it to the model folder. For example:

    $ curl -o model/custom_tool_chat_template_llama3.2_json.jinja \
    https://raw.githubusercontent.com/<CUSTOM_TEMPLATE_REPOSITORY>/custom_tool_chat_template_llama3.2_json.jinja
    Copy to Clipboard Toggle word wrap
  3. Mount the model/ folder and specify the template path when starting AI Inference Server:

    $ podman run --rm -it \
    --userns=keep-id:uid=1001 \
    --device nvidia.com/gpu=all \
    --security-opt=label=disable \
    --shm-size=4g -p 8000:8000 \
    -e HF_HUB_OFFLINE=1 \
    -e TRANSFORMERS_OFFLINE=1 \
    -v ./model:/opt/app-root/model:z \
    -v ./rhaiis-cache:/opt/app-root/src/.cache:Z \
    registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0 \
    --model /model \
    --chat-template /opt/app-root/model/custom_tool_chat_template_llama3.2_json.jinja \
    --enable-auto-tool-choice \
    --tool-call-parser=llama3_json \
    --served-model-name RedHatAI/Llama-3.2-1B-Instruct-FP8 \
    --tensor-parallel-size 2
    Copy to Clipboard Toggle word wrap
    Note

    The --tool-call-parser flag determines how AI Inference Server parses tool calls from the model output. The --enable-auto-tool-choice flag starts the server with automatic tool calling enabled.

Verification

  1. Verify that AI Inference Server started successfully with your custom template. For example:

    $ podman logs 2be60021cf8d | grep -i "chat template"
    Copy to Clipboard Toggle word wrap
  2. Test the chat template with a simple API request to verify that the custom template is being used correctly. For example:

    $ curl http://127.0.0.1:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "RedHatAI/Llama-3.2-1B-Instruct-FP8",
        "messages": [
          {
            "role": "user",
            "content": "What is the weather like in Dublin today?"
          }
        ],
        "tools": [
          {
            "type": "function",
            "function": {
              "name": "get_weather",
              "description": "Return the weather for a city.",
              "parameters": {
                "type": "object",
                "properties": {
                  "city": { "type": "string" }
                },
                "required": ["city"]
              }
            }
          }
        ],
        "tool_choice": "auto",
        "max_tokens": 256
      }' | jq
    Copy to Clipboard Toggle word wrap
    Note

    The example output below shows the tool call information in the content field as a JSON string. Different parsers and chat templates may format tool calling responses differently. Some configurations return tool calls in the tool_calls array following the OpenAI API format, while others may include the tool call data in the message content.

    Example output

    {
      "id": "chatcmpl-8bf4da937fad4917b5dbccbf458716ce",
      "object": "chat.completion",
      "created": 1764324023,
      "model": "RedHatAI/Llama-3.2-1B-Instruct-FP8",
      "choices": [
        {
          "index": 0,
          "message": {
            "role": "assistant",
            "content": "{\"type\": \"function\", \"function\": {\"name\": \"get_weather\", \"parameters\": {\"type\": \"object\", \"properties\": {\"city\": \"Dublin\"}}}",
            "refusal": null,
            "annotations": null,
            "audio": null,
            "function_call": null,
            "tool_calls": [],
            "reasoning_content": null
          },
          "logprobs": null,
          "finish_reason": "stop",
          "stop_reason": 128008,
          "token_ids": null
        }
      ],
      "service_tier": null,
      "system_fingerprint": null,
      "usage": {
        "prompt_tokens": 241,
        "total_tokens": 278,
        "completion_tokens": 37,
        "prompt_tokens_details": null
      },
      "prompt_logprobs": null,
      "prompt_token_ids": null,
      "kv_transfer_params": null
    }
    Copy to Clipboard Toggle word wrap

Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2026 Red Hat
Back to top