Chapter 10. Serving and inferencing with Podman using CPU (x86_64 AVX2)

Serve and inference a large language model with Podman and Red Hat AI Inference Server running on x86_64 CPUs with AVX2 instruction set support.

With CPU-only inference, you can run Red Hat AI Inference Server workloads on x86_64 CPUs without requiring GPU hardware. This feature provides a cost-effective option for development, testing, and small-scale deployments using smaller language models.

Important

Inference serving with AI Inference Server on x86_64 AVX2 CPU is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Note

AVX512 instruction set support is planned for a future release.

Prerequisites

You have installed Podman or Docker.
You are logged in as a user with sudo access.
You have access to registry.redhat.io and have logged in.
You have a Hugging Face account and have generated a Hugging Face access token.
You have access to a Linux server with an x86_64 CPU that supports the AVX2 instruction set:
- Intel Haswell (2013) or newer processors
- AMD Excavator (2015) or newer processors
You have a minimum of 16GB system RAM. 32GB or more is recommended for larger models.

Note

CPU inference is optimized for smaller models, typically under 3 billion parameters. For larger models or production workloads requiring higher throughput, consider using GPU acceleration.

Procedure

Open a terminal on your server host, and log in to registry.redhat.io:
```
$ podman login registry.redhat.io
```

Pull the CPU inference image by running the following command:

$ podman pull registry.redhat.io/rhaiis/vllm-cpu-rhel9:3.3.0

Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.
```
$ mkdir -p rhaiis-cache && chmod g+rwX rhaiis-cache
```
Create or append your HF_TOKEN Hugging Face token to the private.env file. Source the private.env file.
```
$ echo "export HF_TOKEN=<your_HF_token>" > private.env
```
```
$ source private.env
```
Verify that your CPU supports the AVX2 instruction set:
```
$ grep -q avx2 /proc/cpuinfo && echo "AVX2 supported" || echo "AVX2 not supported"
```
Important
If your CPU does not support AVX2, you cannot use CPU inference with Red Hat AI Inference Server.
Start the AI Inference Server container image.
```
$ podman run --rm -it \
--security-opt=label=disable \
--shm-size=4g -p 8000:8000 \
--userns=keep-id:uid=1001 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
--env "VLLM_CPU_KVCACHE_SPACE=4" \
-v ./rhaiis-cache:/opt/app-root/src/.cache:Z \
registry.redhat.io/rhaiis/vllm-cpu-rhel9:3.3.0 \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0
```
- --security-opt=label=disable: Disables SELinux label relabeling for volume mounts. Required for systems where SELinux is enabled. Without this option, the container might fail to start.
- --shm-size=4g -p 8000:8000: Specifies the shared memory size and port mapping. Increase --shm-size to 8GB if you experience shared memory issues.
- --userns=keep-id:uid=1001: Maps the host UID to the effective UID of the vLLM process in the container. Alternatively, you can pass --user=0, but this is less secure because it runs vLLM as root inside the container.
- --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN": Specifies the Hugging Face API access token. Set and export HF_TOKEN with your Hugging Face token.
- --env "VLLM_CPU_KVCACHE_SPACE=4": Allocates 4GB for the CPU key-value cache. Increase this value for larger models or longer context lengths. The default is 4GB.
- -v ./rhaiis-cache:/opt/app-root/src/.cache:Z: Mounts the cache directory with SELinux context. The :Z suffix is required for systems where SELinux is enabled. On Debian, Ubuntu, or Docker without SELinux, omit the :Z suffix.
- --model TinyLlama/TinyLlama-1.1B-Chat-v1.0: Specifies the Hugging Face model to serve. For CPU inference, use smaller models (under 3B parameters) for optimal performance.

Verification

In a separate tab in your terminal, make a request to the model with the API.

curl -X POST -H "Content-Type: application/json" -d '{
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://<your_server_ip>:8000/v1/completions | jq

The model returns a valid JSON response answering your question.

Chapter 10. Serving and inferencing with Podman using CPU (x86_64 AVX2)

Learn

Try, buy, & sell

Communities

About Red Hat

Making open source more inclusive

About Red Hat Documentation

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links