이 콘텐츠는 선택한 언어로 제공되지 않습니다.
Chapter 5. Serving and inferencing language models with Podman using Google TPU AI accelerators
Serve and inference a large language model with Podman or Docker and Red Hat AI Inference Server in a Google cloud VM that has Google TPU AI accelerators available.
Prerequisites
You have access to a Google cloud TPU VM with Google TPU AI accelerators configured. For more information, see:
- You have installed Podman or Docker.
- You are logged in as a user with sudo access.
-
You have access to the
registry.redhat.io
image registry and have logged in. - You have a Hugging Face account and have generated a Hugging Face access token.
For more information about supported vLLM quantization schemes for accelerators, see Supported hardware.
Procedure
Open a terminal on your TPU server host, and log in to
registry.redhat.io
:podman login registry.redhat.io
$ podman login registry.redhat.io
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Pull the Red Hat AI Inference Server image by running the following command:
podman pull registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.1
$ podman pull registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: Verify that the TPUs are available in the host.
Open a shell prompt in the Red Hat AI Inference Server container. Run the following command:
podman run -it --net=host --privileged -e PJRT_DEVICE=TPU --rm --entrypoint /bin/bash registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.1
$ podman run -it --net=host --privileged -e PJRT_DEVICE=TPU --rm --entrypoint /bin/bash registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify system TPU access and basic operations by running the following Python code in the container shell prompt:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Exit the shell prompt.
exit
$ exit
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.
mkdir ./.cache/rhaiis
$ mkdir ./.cache/rhaiis
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chmod g+rwX ./.cache/rhaiis
$ chmod g+rwX ./.cache/rhaiis
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Add the
HF_TOKEN
Hugging Face token to theprivate.env
file.echo "export HF_TOKEN=<huggingface_token>" > private.env
$ echo "export HF_TOKEN=<huggingface_token>" > private.env
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Append the
HF_HOME
variable to theprivate.env
file.echo "export HF_HOME=./.cache/rhaiis" >> private.env
$ echo "export HF_HOME=./.cache/rhaiis" >> private.env
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Source the
private.env
file.source private.env
$ source private.env
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Start the AI Inference Server container image:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Set
--tensor-parallel-size
to match the number of TPUs.
Verification
Check that the AI Inference Server server is up. Open a separate tab in your terminal, and make a model request with the API:
Example output