Chapter 2. Using custom chat templates with AI Inference Server and Podman
You can use chat templates in Red Hat AI Inference Server to control how conversations and tool calls are formatted when inference serving a language model. You can use default templates provided with the model or mount custom templates for specialized tool calling formats.
If you don’t need a custom template, you can inference serve the model referencing the appropriate template that is included in the AI Inference Server container image. The chat templates in the vLLM repository examples folder are included in the AI Inference Server container images at /opt/app-root/template/ by default.
Different models require different specific tool calling parameters and combinations of template and parser settings. The default chat template is set in the chat_template field in tokenizer_config.json file in the model repository. Not all models include a default chat template. If the model supports chat templates and does not provide a template, you must create or mount a custom template.
For more information about vLLM tool calling configuration, see Tool calling.
Prerequisites
- You have installed Podman or Docker.
- You have deployed Red Hat AI Inference Server on a host that has one or more AI Accelerators configured.
- You are inference serving a model that supports chat templates.
- You have a Hugging Face account and have generated a Hugging Face access token.
- You have configured SELinux to allow device access if your system has SELinux enabled.
-
You have created a
rhaiis-cache/folder for mounting as a volume in the container and have adjusted the container permissions so that the container can use it. - You are familiar with the Jinja2 template syntax and have created the custom chat template for the model.
Procedure
Create the
model/folder and download the model into it. For example, clone the Llama-3.2-1B-Instruct-FP8 model from Hugging Face:mkdir -p model && huggingface-cli download RedHatAI/Llama-3.2-1B-Instruct-FP8 --local-dir model
$ mkdir -p model && huggingface-cli download RedHatAI/Llama-3.2-1B-Instruct-FP8 --local-dir modelCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create a custom template file for the Llama-3.2-1B-Instruct-FP8 model and add it to the model folder. For example:
curl -o model/custom_tool_chat_template_llama3.2_json.jinja \ https://raw.githubusercontent.com/<CUSTOM_TEMPLATE_REPOSITORY>/custom_tool_chat_template_llama3.2_json.jinja
$ curl -o model/custom_tool_chat_template_llama3.2_json.jinja \ https://raw.githubusercontent.com/<CUSTOM_TEMPLATE_REPOSITORY>/custom_tool_chat_template_llama3.2_json.jinjaCopy to Clipboard Copied! Toggle word wrap Toggle overflow Mount the
model/folder and specify the template path when starting AI Inference Server:Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteThe
--tool-call-parserflag determines how AI Inference Server parses tool calls from the model output. The--enable-auto-tool-choiceflag starts the server with automatic tool calling enabled.
Verification
Verify that AI Inference Server started successfully with your custom template. For example:
podman logs 2be60021cf8d | grep -i "chat template"
$ podman logs 2be60021cf8d | grep -i "chat template"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Test the chat template with a simple API request to verify that the custom template is being used correctly. For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteThe example output below shows the tool call information in the
contentfield as a JSON string. Different parsers and chat templates may format tool calling responses differently. Some configurations return tool calls in thetool_callsarray following the OpenAI API format, while others may include the tool call data in the message content.Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow