Extending Red Hat AI Inference Server with tool calling capabilities
Configuring tool calling and chat templates for AI Inference Server
Abstract
Chapter 1. About tool calling Copy linkLink copied to clipboard!
Tool calling extends model functionality beyond text generation, allowing it to perform actions such as querying databases, calling APIs, executing calculations, or retrieving real-time information.
Red Hat AI Inference Server provides comprehensive support for tool calling through the upstream vLLM project. With tool calling, the language model can understand when it needs to call an external tool and select the appropriate tool based on the context. The model generates properly formatted tool call requests and integrates the tool response back into the conversation flow.
Tool calling provides several advantages for building intelligent applications with Red Hat AI Inference Server:
- Provides access to real-time information beyond the model training data
- Enables the model to perform precise calculations and data processing
- Integrates with existing APIs and services
- Enhances reliability through structured function calls
- Separates reasoning and execution
Chapter 2. Using custom chat templates with AI Inference Server and Podman Copy linkLink copied to clipboard!
You can use chat templates in Red Hat AI Inference Server to control how conversations and tool calls are formatted when inference serving a language model. You can use default templates provided with the model or mount custom templates for specialized tool calling formats.
If you don’t need a custom template, you can inference serve the model referencing the appropriate template that is included in the AI Inference Server container image. The chat templates in the vLLM repository examples folder are included in the AI Inference Server container images at /opt/app-root/template/ by default.
Different models require different specific tool calling parameters and combinations of template and parser settings. The default chat template is set in the chat_template field in tokenizer_config.json file in the model repository. Not all models include a default chat template. If the model supports chat templates and does not provide a template, you must create or mount a custom template.
For more information about vLLM tool calling configuration, see Tool calling.
Prerequisites
- You have installed Podman or Docker.
- You have deployed Red Hat AI Inference Server on a host that has one or more AI Accelerators configured.
- You are inference serving a model that supports chat templates.
- You have a Hugging Face account and have generated a Hugging Face access token.
- You have configured SELinux to allow device access if your system has SELinux enabled.
-
You have created a
rhaiis-cache/folder for mounting as a volume in the container and have adjusted the container permissions so that the container can use it. - You are familiar with the Jinja2 template syntax and have created the custom chat template for the model.
Procedure
Create the
model/folder and download the model into it. For example, clone the Llama-3.2-1B-Instruct-FP8 model from Hugging Face:mkdir -p model && huggingface-cli download RedHatAI/Llama-3.2-1B-Instruct-FP8 --local-dir model
$ mkdir -p model && huggingface-cli download RedHatAI/Llama-3.2-1B-Instruct-FP8 --local-dir modelCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create a custom template file for the Llama-3.2-1B-Instruct-FP8 model and add it to the model folder. For example:
curl -o model/custom_tool_chat_template_llama3.2_json.jinja \ https://raw.githubusercontent.com/<CUSTOM_TEMPLATE_REPOSITORY>/custom_tool_chat_template_llama3.2_json.jinja
$ curl -o model/custom_tool_chat_template_llama3.2_json.jinja \ https://raw.githubusercontent.com/<CUSTOM_TEMPLATE_REPOSITORY>/custom_tool_chat_template_llama3.2_json.jinjaCopy to Clipboard Copied! Toggle word wrap Toggle overflow Mount the
model/folder and specify the template path when starting AI Inference Server:Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteThe
--tool-call-parserflag determines how AI Inference Server parses tool calls from the model output. The--enable-auto-tool-choiceflag starts the server with automatic tool calling enabled.
Verification
Verify that AI Inference Server started successfully with your custom template. For example:
podman logs 2be60021cf8d | grep -i "chat template"
$ podman logs 2be60021cf8d | grep -i "chat template"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Test the chat template with a simple API request to verify that the custom template is being used correctly. For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteThe example output below shows the tool call information in the
contentfield as a JSON string. Different parsers and chat templates may format tool calling responses differently. Some configurations return tool calls in thetool_callsarray following the OpenAI API format, while others may include the tool call data in the message content.Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Chapter 3. Tool call parser type reference Copy linkLink copied to clipboard!
Tool call parsers determine how AI Inference Server interprets and extracts tool calling information from the model output. Each model has specific output formats for tool calls. You must configure the appropriate parser that matches the expected format for the model. The tool call parser is specified using the --tool-call-parser flag when launching AI Inference Server. AI Inference Server supports multiple tool call parsers for different model families and output formats.
| Parser | Description | Example models |
|---|---|---|
|
| For Nous-Hermes models that use the Hermes tool calling format. |
|
|
| For Mistral models using Mistral’s native tool calling format. |
|
|
|
For Llama 3.x models configured to output tool calls in JSON format. When using |
|
|
| For InternLM2 models using InternLM’s tool calling format. |
|
|
| For IBM Granite function-calling models. |
|
|
| For Adept Fuyu models. |
|
|
| For Microsoft Phi-3 models configured for JSON output. |
|
|
| For AI21 Labs Jamba models. |
|
Using an incorrect parser can result in runtime errors such as the following:
- Failed tool call extraction
- Malformed tool call requests
- Errors during inference
- Unexpected model behavior
Always verify that the parser matches the expected tool calling format for the model.
Chapter 4. Tool choice options reference Copy linkLink copied to clipboard!
You can configure how the language model decides when to call a tool by using the tool_choice parameter.
Red Hat AI Inference Server supports multiple tool choice modes that enable different levels of control over tool calling behavior. You set the tool choice value when you create a request that triggers the model to use the available tools, for example:
Select the appropriate tool choice mode based on your application requirements:
| Mode | Behavior | Use case |
|---|---|---|
|
| The model decides whether to call a tool based on the conversation context and available tools. | General purpose tool calling where the model determines when tools are needed. |
|
| The model must call at least one of the available tools and cannot respond without making a tool call. | Scenarios where a tool call is mandatory for the task, such as data retrieval or structured output generation, for example, database queries or API interactions. |
|
| The model does not call any tools and provides a text-based response only. | Disabling tool calling for specific requests while keeping tool definitions available for context. |
|
| The model must call the specific named tool. | Forcing the model to use a particular tool, useful for structured data extraction or specific workflows. |
When using tool_choice="required", ensure that at least one of the available tools is appropriate for the request. If no suitable tool exists, the model might call an inappropriate tool or generate errors.