Chapter 10. Serving and inferencing with Podman using CPU (x86_64 AVX2)
Serve and inference a large language model with Podman and Red Hat AI Inference Server running on x86_64 CPUs with AVX2 instruction set support.
With CPU-only inference, you can run Red Hat AI Inference Server workloads on x86_64 CPUs without requiring GPU hardware. This feature provides a cost-effective option for development, testing, and small-scale deployments using smaller language models.
{feature-name} is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
AVX512 instruction set support is planned for a future release.
Prerequisites
- You have installed Podman or Docker.
- You are logged in as a user with sudo access.
-
You have access to
registry.redhat.ioand have logged in. - You have a Hugging Face account and have generated a Hugging Face access token.
You have access to a Linux server with an x86_64 CPU that supports the AVX2 instruction set:
- Intel Haswell (2013) or newer processors
- AMD Excavator (2015) or newer processors
- You have a minimum of 16GB system RAM. 32GB or more is recommended for larger models.
CPU inference is optimized for smaller models, typically under 3 billion parameters. For larger models or production workloads requiring higher throughput, consider using GPU acceleration.
Procedure
Open a terminal on your server host, and log in to
registry.redhat.io:podman login registry.redhat.io
$ podman login registry.redhat.ioCopy to Clipboard Copied! Toggle word wrap Toggle overflow Pull the CPU inference image by running the following command:
podman pull registry.redhat.io/rhaiis/vllm-cpu-rhel9:3.3.0
$ podman pull registry.redhat.io/rhaiis/vllm-cpu-rhel9:3.3.0Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.
mkdir -p rhaiis-cache && chmod g+rwX rhaiis-cache
$ mkdir -p rhaiis-cache && chmod g+rwX rhaiis-cacheCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create or append your
HF_TOKENHugging Face token to theprivate.envfile. Source theprivate.envfile.echo "export HF_TOKEN=<your_HF_token>" > private.env
$ echo "export HF_TOKEN=<your_HF_token>" > private.envCopy to Clipboard Copied! Toggle word wrap Toggle overflow source private.env
$ source private.envCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that your CPU supports the AVX2 instruction set:
grep -q avx2 /proc/cpuinfo && echo "AVX2 supported" || echo "AVX2 not supported"
$ grep -q avx2 /proc/cpuinfo && echo "AVX2 supported" || echo "AVX2 not supported"Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantIf your CPU does not support AVX2, you cannot use CPU inference with Red Hat AI Inference Server.
Start the AI Inference Server container image.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
--security-opt=label=disable: Disables SELinux label relabeling for volume mounts. Required for systems where SELinux is enabled. Without this option, the container might fail to start. -
--shm-size=4g -p 8000:8000: Specifies the shared memory size and port mapping. Increase--shm-sizeto8GBif you experience shared memory issues. -
--userns=keep-id:uid=1001: Maps the host UID to the effective UID of the vLLM process in the container. Alternatively, you can pass--user=0, but this is less secure because it runs vLLM as root inside the container. -
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN": Specifies the Hugging Face API access token. Set and exportHF_TOKENwith your Hugging Face token. -
--env "VLLM_CPU_KVCACHE_SPACE=4": Allocates 4GB for the CPU key-value cache. Increase this value for larger models or longer context lengths. The default is 4GB. -
-v ./rhaiis-cache:/opt/app-root/src/.cache:Z: Mounts the cache directory with SELinux context. The:Zsuffix is required for systems where SELinux is enabled. On Debian, Ubuntu, or Docker without SELinux, omit the:Zsuffix. -
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0: Specifies the Hugging Face model to serve. For CPU inference, use smaller models (under 3B parameters) for optimal performance.
-
Verification
In a separate tab in your terminal, make a request to the model with the API.
curl -X POST -H "Content-Type: application/json" -d '{ "prompt": "What is the capital of France?", "max_tokens": 50 }' http://<your_server_ip>:8000/v1/completions | jqcurl -X POST -H "Content-Type: application/json" -d '{ "prompt": "What is the capital of France?", "max_tokens": 50 }' http://<your_server_ip>:8000/v1/completions | jqCopy to Clipboard Copied! Toggle word wrap Toggle overflow The model returns a valid JSON response answering your question.