Chapter 9. Serving and inferencing language models with Podman using AWS Trainium and Inferentia AI accelerators
Serve and inference a large language model with Podman or Docker and Red Hat AI Inference Server on an AWS cloud instance that has AWS Trainium or Inferentia AI accelerators configured.
AWS Inferentia and AWS Trainium are custom-designed machine learning chips from Amazon Web Services (AWS). Red Hat AI Inference Server integrates with these accelerators through the AWS Neuron SDK, providing a path to deploy vLLM-based inference workloads on AWS cloud infrastructure.
AWS Trainium and Inferentia support is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
Prerequisites
- You have access to an AWS Inf2, Trn1, Trn1n, or Trn2 instance with AWS Neuron drivers configured. See Neuron setup guide.
- You have installed Podman or Docker.
- You are logged in as a user that has sudo access.
-
You have access to the
registry.redhat.ioimage registry. - You have a Hugging Face account and have generated a Hugging Face access token.
Procedure
Open a terminal on your AWS host, and log in to
registry.redhat.io:podman login registry.redhat.io
$ podman login registry.redhat.ioCopy to Clipboard Copied! Toggle word wrap Toggle overflow Pull the Red Hat AI Inference Server image for Neuron by running the following command:
podman pull registry.redhat.io/rhaiis/vllm-neuron-rhel9:3.3.0
$ podman pull registry.redhat.io/rhaiis/vllm-neuron-rhel9:3.3.0Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: Verify that the Neuron drivers and devices are available on the host.
Run
neuron-lsto verify that Neuron drivers are installed and to view detailed information about the Neuron hardware:neuron-ls
$ neuron-lsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Note the number of Neuron cores available. Use this information to set
--tensor-parallel-sizeargument when starting the container.List the Neuron devices:
ls /dev/neuron*
$ ls /dev/neuron*Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
/dev/neuron0
/dev/neuron0Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Create a volume for mounting into the container and adjust the permissions so that the container can use it:
mkdir -p ./.cache/rhaiis && chmod g+rwX ./.cache/rhaiis
$ mkdir -p ./.cache/rhaiis && chmod g+rwX ./.cache/rhaiisCopy to Clipboard Copied! Toggle word wrap Toggle overflow Add the
HF_TOKENHugging Face token to theprivate.envfile.echo "export HF_TOKEN=<huggingface_token>" > private.env
$ echo "export HF_TOKEN=<huggingface_token>" > private.envCopy to Clipboard Copied! Toggle word wrap Toggle overflow Append the
HF_HOMEvariable to theprivate.envfile.echo "export HF_HOME=./.cache/rhaiis" >> private.env
$ echo "export HF_HOME=./.cache/rhaiis" >> private.envCopy to Clipboard Copied! Toggle word wrap Toggle overflow Source the
private.envfile.source private.env
$ source private.envCopy to Clipboard Copied! Toggle word wrap Toggle overflow Start the AI Inference Server container image:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow --device=/dev/neuron0- Map the required Neuron device. Adjust based on your model requirements and available Neuron memory.
--no-enable-prefix-caching- Disable prefix caching for Neuron hardware.
--tensor-parallel-size 2-
Set
--tensor-parallel-sizeto match the number of neuron cores being used. --additional-config '{ "override_neuron_config": { "async_mode": false } }'-
The
--additional-configparameter passes Neuron-specific configuration. Settingasync_modetofalseis recommended for stability.
Verification
Check that the AI Inference Server server is up. Open a separate tab in your terminal, and make a model request with the API:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow