Este contenido no está disponible en el idioma seleccionado.

Chapter 9. Serving and inferencing language models with Podman using AWS Trainium and Inferentia AI accelerators

Serve and inference a large language model with Podman or Docker and Red Hat AI Inference Server on an AWS cloud instance that has AWS Trainium or Inferentia AI accelerators configured.

AWS Inferentia and AWS Trainium are custom-designed machine learning chips from Amazon Web Services (AWS). Red Hat AI Inference Server integrates with these accelerators through the AWS Neuron SDK, providing a path to deploy vLLM-based inference workloads on AWS cloud infrastructure.

Important

AWS Trainium and Inferentia support is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Prerequisites

You have access to an AWS Inf2, Trn1, Trn1n, or Trn2 instance with AWS Neuron drivers configured. See Neuron setup guide.
You have installed Podman or Docker.
You are logged in as a user that has sudo access.
You have access to the registry.redhat.io image registry.
You have a Hugging Face account and have generated a Hugging Face access token.

Procedure

Open a terminal on your AWS host, and log in to registry.redhat.io:
```
podman login registry.redhat.io
```
```
$ podman login registry.redhat.io
```
Copy to Clipboard Toggle word wrap
Pull the Red Hat AI Inference Server image for Neuron by running the following command:
```
podman pull registry.redhat.io/rhaiis/vllm-neuron-rhel9:3.3.0
```
```
$ podman pull registry.redhat.io/rhaiis/vllm-neuron-rhel9:3.3.0
```
Copy to Clipboard Toggle word wrap

Optional: Verify that the Neuron drivers and devices are available on the host.

Run neuron-ls to verify that Neuron drivers are installed and to view detailed information about the Neuron hardware:

neuron-ls

$ neuron-ls

Copy to Clipboard

Toggle word wrap

Example output

instance-type: trn1.2xlarge
instance-id: i-0b29616c0f73dc323
+--------+--------+----------+--------+--------------+----------+------+
| NEURON | NEURON |  NEURON  | NEURON |     PCI      |   CPU    | NUMA |
| DEVICE | CORES  | CORE IDS | MEMORY |     BDF      | AFFINITY | NODE |
+--------+--------+----------+--------+--------------+----------+------+
| 0      | 2      | 0-1      | 32 GB  | 0000:00:1e.0 | 0-7      | -1   |
+--------+--------+----------+--------+--------------+----------+------+

instance-type: trn1.2xlarge
instance-id: i-0b29616c0f73dc323
+--------+--------+----------+--------+--------------+----------+------+
| NEURON | NEURON |  NEURON  | NEURON |     PCI      |   CPU    | NUMA |
| DEVICE | CORES  | CORE IDS | MEMORY |     BDF      | AFFINITY | NODE |
+--------+--------+----------+--------+--------------+----------+------+
| 0      | 2      | 0-1      | 32 GB  | 0000:00:1e.0 | 0-7      | -1   |
+--------+--------+----------+--------+--------------+----------+------+

Copy to Clipboard

Toggle word wrap

Note the number of Neuron cores available. Use this information to set --tensor-parallel-size argument when starting the container.

List the Neuron devices:
```
ls /dev/neuron*
```
```
$ ls /dev/neuron*
```
Copy to Clipboard Toggle word wrap
Example output
```
/dev/neuron0
```
```
/dev/neuron0
```
Copy to Clipboard Toggle word wrap

Create a volume for mounting into the container and adjust the permissions so that the container can use it:
```
mkdir -p ./.cache/rhaiis && chmod g+rwX ./.cache/rhaiis
```
```
$ mkdir -p ./.cache/rhaiis && chmod g+rwX ./.cache/rhaiis
```
Copy to Clipboard Toggle word wrap

Add the HF_TOKEN Hugging Face token to the private.env file.

echo "export HF_TOKEN=<huggingface_token>" > private.env

$ echo "export HF_TOKEN=<huggingface_token>" > private.env

Copy to Clipboard

Toggle word wrap

Append the HF_HOME variable to the private.env file.
```
echo "export HF_HOME=./.cache/rhaiis" >> private.env
```
```
$ echo "export HF_HOME=./.cache/rhaiis" >> private.env
```
Copy to Clipboard Toggle word wrap
Source the private.env file.
```
source private.env
```
```
$ source private.env
```
Copy to Clipboard Toggle word wrap

Start the AI Inference Server container image:

sudo podman run -it --rm \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  -e HF_HUB_OFFLINE=0 \
  --network=host \
  --device=/dev/neuron0 \
  -p 8000:8000 \
  -v $HOME/.cache/rhaiis:/root/.cache/huggingface \
  -v ./.cache/rhaiis:/opt/app-root/src/.cache:Z \
  registry.redhat.io/rhaiis/vllm-neuron-rhel9:3.3.0 \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --max-model-len 4096 \
  --max-num-seqs 1 \
  --no-enable-prefix-caching \
  --port 8000 \
  --tensor-parallel-size 2 \
  --additional-config '{ "override_neuron_config": { "async_mode": false } }'

$ sudo podman run -it --rm \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  -e HF_HUB_OFFLINE=0 \
  --network=host \
  --device=/dev/neuron0 \
  -p 8000:8000 \
  -v $HOME/.cache/rhaiis:/root/.cache/huggingface \
  -v ./.cache/rhaiis:/opt/app-root/src/.cache:Z \
  registry.redhat.io/rhaiis/vllm-neuron-rhel9:3.3.0 \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --max-model-len 4096 \
  --max-num-seqs 1 \
  --no-enable-prefix-caching \
  --port 8000 \
  --tensor-parallel-size 2 \
  --additional-config '{ "override_neuron_config": { "async_mode": false } }'

Copy to Clipboard

Toggle word wrap

--device=/dev/neuron0: Map the required Neuron device. Adjust based on your model requirements and available Neuron memory.
--no-enable-prefix-caching: Disable prefix caching for Neuron hardware.
--tensor-parallel-size 2: Set --tensor-parallel-size to match the number of neuron cores being used.
--additional-config '{ "override_neuron_config": { "async_mode": false } }': The --additional-config parameter passes Neuron-specific configuration. Setting async_mode to false is recommended for stability.

Verification

Check that the AI Inference Server server is up. Open a separate tab in your terminal, and make a model request with the API:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [
      {"role": "user", "content": "Briefly, what color is the wind?"}
    ],
    "max_tokens": 50
  }' | jq

$ curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [
      {"role": "user", "content": "Briefly, what color is the wind?"}
    ],
    "max_tokens": 50
  }' | jq

Copy to Clipboard

Toggle word wrap

Example output

{
  "id": "chatcmpl-abc123def456",
  "object": "chat.completion",
  "created": 1755268559,
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The wind is typically associated with the color white or grey, as it can carry dust, sand, or other particles. However, it is not a color in the traditional sense.",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 38,
    "total_tokens": 75,
    "completion_tokens": 37,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "kv_transfer_params": null
}

{
  "id": "chatcmpl-abc123def456",
  "object": "chat.completion",
  "created": 1755268559,
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The wind is typically associated with the color white or grey, as it can carry dust, sand, or other particles. However, it is not a color in the traditional sense.",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 38,
    "total_tokens": 75,
    "completion_tokens": 37,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "kv_transfer_params": null
}

Copy to Clipboard

Toggle word wrap

Este contenido no está disponible en el idioma seleccionado.

Chapter 9. Serving and inferencing language models with Podman using AWS Trainium and Inferentia AI accelerators

Aprender

Pruebe, compre y venda

Comunidades

Acerca de la documentación de Red Hat

Hacer que el código abierto sea más inclusivo

Acerca de Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links