Extending Red Hat AI Inference Server with tool calling capabilities


Red Hat AI Inference Server 3.3

Configuring tool calling and chat templates for AI Inference Server

Red Hat AI Documentation Team

Abstract

Learn how to configure and use tool calling features and chat templates with Red Hat AI Inference Server for enhanced language model functionality.

Chapter 1. About tool calling

Tool calling extends model functionality beyond text generation, allowing it to perform actions such as querying databases, calling APIs, executing calculations, or retrieving real-time information.

Red Hat AI Inference Server provides comprehensive support for tool calling through the upstream vLLM project. With tool calling, the language model can understand when it needs to call an external tool and select the appropriate tool based on the context. The model generates properly formatted tool call requests and integrates the tool response back into the conversation flow.

Tool calling provides several advantages for building intelligent applications with Red Hat AI Inference Server:

  • Provides access to real-time information beyond the model training data
  • Enables the model to perform precise calculations and data processing
  • Integrates with existing APIs and services
  • Enhances reliability through structured function calls
  • Separates reasoning and execution

You can use chat templates in Red Hat AI Inference Server to control how conversations and tool calls are formatted when inference serving a language model. You can use default templates provided with the model or mount custom templates for specialized tool calling formats.

Note

If you don’t need a custom template, you can inference serve the model referencing the appropriate template that is included in the AI Inference Server container image. The chat templates in the vLLM repository examples folder are included in the AI Inference Server container images at /opt/app-root/template/ by default.

Different models require different specific tool calling parameters and combinations of template and parser settings. The default chat template is set in the chat_template field in tokenizer_config.json file in the model repository. Not all models include a default chat template. If the model supports chat templates and does not provide a template, you must create or mount a custom template.

For more information about vLLM tool calling configuration, see Tool calling.

Prerequisites

  • You have installed Podman or Docker.
  • You have deployed Red Hat AI Inference Server on a host that has one or more AI Accelerators configured.
  • You are inference serving a model that supports chat templates.
  • You have a Hugging Face account and have generated a Hugging Face access token.
  • You have configured SELinux to allow device access if your system has SELinux enabled.
  • You have created a rhaiis-cache/ folder for mounting as a volume in the container and have adjusted the container permissions so that the container can use it.
  • You are familiar with the Jinja2 template syntax and have created the custom chat template for the model.

Procedure

  1. Create the model/ folder and download the model into it. For example, clone the Llama-3.2-1B-Instruct-FP8 model from Hugging Face:

    $ mkdir -p model && huggingface-cli download RedHatAI/Llama-3.2-1B-Instruct-FP8 --local-dir model
    Copy to Clipboard Toggle word wrap
  2. Create a custom template file for the Llama-3.2-1B-Instruct-FP8 model and add it to the model folder. For example:

    $ curl -o model/custom_tool_chat_template_llama3.2_json.jinja \
    https://raw.githubusercontent.com/<CUSTOM_TEMPLATE_REPOSITORY>/custom_tool_chat_template_llama3.2_json.jinja
    Copy to Clipboard Toggle word wrap
  3. Mount the model/ folder and specify the template path when starting AI Inference Server:

    $ podman run --rm -it \
    --userns=keep-id:uid=1001 \
    --device nvidia.com/gpu=all \
    --security-opt=label=disable \
    --shm-size=4g -p 8000:8000 \
    -e HF_HUB_OFFLINE=1 \
    -e TRANSFORMERS_OFFLINE=1 \
    -v ./model:/opt/app-root/model:z \
    -v ./rhaiis-cache:/opt/app-root/src/.cache:Z \
    registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0 \
    --model /model \
    --chat-template /opt/app-root/model/custom_tool_chat_template_llama3.2_json.jinja \
    --enable-auto-tool-choice \
    --tool-call-parser=llama3_json \
    --served-model-name RedHatAI/Llama-3.2-1B-Instruct-FP8 \
    --tensor-parallel-size 2
    Copy to Clipboard Toggle word wrap
    Note

    The --tool-call-parser flag determines how AI Inference Server parses tool calls from the model output. The --enable-auto-tool-choice flag starts the server with automatic tool calling enabled.

Verification

  1. Verify that AI Inference Server started successfully with your custom template. For example:

    $ podman logs 2be60021cf8d | grep -i "chat template"
    Copy to Clipboard Toggle word wrap
  2. Test the chat template with a simple API request to verify that the custom template is being used correctly. For example:

    $ curl http://127.0.0.1:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "RedHatAI/Llama-3.2-1B-Instruct-FP8",
        "messages": [
          {
            "role": "user",
            "content": "What is the weather like in Dublin today?"
          }
        ],
        "tools": [
          {
            "type": "function",
            "function": {
              "name": "get_weather",
              "description": "Return the weather for a city.",
              "parameters": {
                "type": "object",
                "properties": {
                  "city": { "type": "string" }
                },
                "required": ["city"]
              }
            }
          }
        ],
        "tool_choice": "auto",
        "max_tokens": 256
      }' | jq
    Copy to Clipboard Toggle word wrap
    Note

    The example output below shows the tool call information in the content field as a JSON string. Different parsers and chat templates may format tool calling responses differently. Some configurations return tool calls in the tool_calls array following the OpenAI API format, while others may include the tool call data in the message content.

    Example output

    {
      "id": "chatcmpl-8bf4da937fad4917b5dbccbf458716ce",
      "object": "chat.completion",
      "created": 1764324023,
      "model": "RedHatAI/Llama-3.2-1B-Instruct-FP8",
      "choices": [
        {
          "index": 0,
          "message": {
            "role": "assistant",
            "content": "{\"type\": \"function\", \"function\": {\"name\": \"get_weather\", \"parameters\": {\"type\": \"object\", \"properties\": {\"city\": \"Dublin\"}}}",
            "refusal": null,
            "annotations": null,
            "audio": null,
            "function_call": null,
            "tool_calls": [],
            "reasoning_content": null
          },
          "logprobs": null,
          "finish_reason": "stop",
          "stop_reason": 128008,
          "token_ids": null
        }
      ],
      "service_tier": null,
      "system_fingerprint": null,
      "usage": {
        "prompt_tokens": 241,
        "total_tokens": 278,
        "completion_tokens": 37,
        "prompt_tokens_details": null
      },
      "prompt_logprobs": null,
      "prompt_token_ids": null,
      "kv_transfer_params": null
    }
    Copy to Clipboard Toggle word wrap

Chapter 3. Tool call parser type reference

Tool call parsers determine how AI Inference Server interprets and extracts tool calling information from the model output. Each model has specific output formats for tool calls. You must configure the appropriate parser that matches the expected format for the model. The tool call parser is specified using the --tool-call-parser flag when launching AI Inference Server. AI Inference Server supports multiple tool call parsers for different model families and output formats.

Expand
Table 3.1. Tool call parsers
ParserDescriptionExample models

hermes

For Nous-Hermes models that use the Hermes tool calling format.

  • NousResearch/Hermes-2-Pro-Llama-3-8B
  • NousResearch/Hermes-3-Llama-3.1-8B

mistral

For Mistral models using Mistral’s native tool calling format.

  • mistralai/Mistral-7B-Instruct-v0.3
  • mistralai/Mixtral-8x7B-Instruct-v0.1

llama3_json

For Llama 3.x models configured to output tool calls in JSON format. When using llama3_json, you typically pair it with a JSON-formatted chat template.

  • meta-llama/Llama-3.2-1B-Instruct
  • meta-llama/Llama-3.1-8B-Instruct

internlm2

For InternLM2 models using InternLM’s tool calling format.

  • internlm/internlm2-chat-7b
  • internlm/internlm2-chat-20b

granite-20b-fc

For IBM Granite function-calling models.

ibm-granite/granite-20b-functioncalling

fuyu

For Adept Fuyu models.

adept/fuyu-8b

phi3_json

For Microsoft Phi-3 models configured for JSON output.

  • microsoft/Phi-3-mini-4k-instruct
  • microsoft/Phi-3-medium-128k-instruct

jamba

For AI21 Labs Jamba models.

ai21labs/Jamba-v0.1

Important

Using an incorrect parser can result in runtime errors such as the following:

  • Failed tool call extraction
  • Malformed tool call requests
  • Errors during inference
  • Unexpected model behavior

Always verify that the parser matches the expected tool calling format for the model.

Chapter 4. Tool choice options reference

You can configure how the language model decides when to call a tool by using the tool_choice parameter.

Red Hat AI Inference Server supports multiple tool choice modes that enable different levels of control over tool calling behavior. You set the tool choice value when you create a request that triggers the model to use the available tools, for example:

response = client.chat.completions.create(
    model=client.models.list().data[0].id,
    messages=[{"role": "user", "content": "What's the weather like in San Francisco?"}],
    tools=tools,
    tool_choice="auto",
)
Copy to Clipboard Toggle word wrap

Select the appropriate tool choice mode based on your application requirements:

Expand
Table 4.1. Tool choice modes
ModeBehaviorUse case

auto

The model decides whether to call a tool based on the conversation context and available tools.

General purpose tool calling where the model determines when tools are needed.

required

The model must call at least one of the available tools and cannot respond without making a tool call.

Scenarios where a tool call is mandatory for the task, such as data retrieval or structured output generation, for example, database queries or API interactions.

none

The model does not call any tools and provides a text-based response only.

Disabling tool calling for specific requests while keeping tool definitions available for context.

{"type": "function", "function": {"name": "tool_name"}}

The model must call the specific named tool.

Forcing the model to use a particular tool, useful for structured data extraction or specific workflows.

Important

When using tool_choice="required", ensure that at least one of the available tools is appropriate for the request. If no suitable tool exists, the model might call an inappropriate tool or generate errors.

Legal Notice

Copyright © Red Hat.
Except as otherwise noted below, the text of and illustrations in this documentation are licensed by Red Hat under the Creative Commons Attribution–Share Alike 3.0 Unported license . If you distribute this document or an adaptation of it, you must provide the URL for the original version.
Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law.
Red Hat, the Red Hat logo, JBoss, Hibernate, and RHCE are trademarks or registered trademarks of Red Hat, LLC. or its subsidiaries in the United States and other countries.
Linux® is the registered trademark of Linus Torvalds in the United States and other countries.
XFS is a trademark or registered trademark of Hewlett Packard Enterprise Development LP or its subsidiaries in the United States and other countries.
The OpenStack® Word Mark and OpenStack logo are trademarks or registered trademarks of the Linux Foundation, used under license.
All other trademarks are the property of their respective owners.
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2026 Red Hat
Back to top