Este conteúdo não está disponível no idioma selecionado.
Chapter 11. Inference language models on x86_64 CPUs
With CPU inference, you can run Red Hat AI Inference workloads on x86_64 processors without dedicated GPU hardware. This feature provides a cost-effective option for development, testing, and small-scale deployments by using smaller language models.
{feature-name} is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
CPU inference is useful where AI accelerator hardware is unavailable or impractical, such as development and testing environments, edge deployments, small-scale setups, and so on. CPU inference works best with smaller models (under 3 billion parameters), delivers lower throughput than GPU-accelerated inference, and can demand substantial system memory (32 GB RAM or more for 1–3B parameter models). For high-throughput or large-model production workloads, GPU acceleration via NVIDIA CUDA or AMD ROCm is the recommended path.
The AI Inference CPU container image supports multiple instruction set architectures (ISAs) in a single build. The container automatically detects and uses the best available instruction set for your processor.
| Instruction set | Intel support | AMD support |
|---|---|---|
| AVX2 (minimum) | Haswell (2013) or newer | Excavator (2015) or newer |
| AVX512 | Skylake-X (2017) or newer | Zen 4 (2022) or newer |
| AVX512 Advanced Matrix Extensions (AMX) | Sapphire Rapids (2023) or newer | Not supported |
AVX512 AMX provides hardware acceleration for matrix operations, which can improve inference performance for supported models. Because AMD processors do not support AMX, Intel Sapphire Rapids or newer processors deliver the best CPU inference performance.
11.1. Serve a model with Podman by using CPU on x86_64 Copiar o linkLink copiado para a área de transferência!
Serve and inference a large language model with Podman and Red Hat AI Inference running on x86_64 CPUs.
With CPU-only inference, you can run Red Hat AI Inference workloads on x86_64 CPUs without requiring GPU hardware. This feature provides a cost-effective option for development, testing, and small-scale deployments by using smaller language models. The CPU container image supports AVX2, AVX512, and AVX512 Advanced Matrix Extensions (AMX) instruction sets in a single build by automatically detecting and using the best available instruction set for your processor.
Inference serving with AI Inference on x86_64 CPU is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
Prerequisites
- You have installed Podman or Docker.
- You have logged in as a user with sudo access.
-
You have access to
registry.redhat.ioand have logged in. - You have a Hugging Face account and have generated a Hugging Face access token.
You have access to a Linux server with an x86_64 CPU that supports at least the AVX2 instruction set:
- AVX2, minimum requirement: Intel Haswell (2013) or newer, or Advanced Micro Devices (AMD) Excavator (2015) or newer
- AVX512: Intel Skylake-X (2017) or newer, or AMD Zen 4 (2022) or newer
- AVX512 AMX: Intel Sapphire Rapids (2023) or newer
The container automatically detects and uses the best available instruction set for your processor.
- You have a minimum of 16 GB system RAM. Red Hat recommends 32 GB or more for larger models.
CPU inference works best for smaller models, typically under 3 billion parameters. For larger models or production workloads requiring higher throughput, consider using GPU acceleration.
Procedure
Open a terminal on your server host and log in to
registry.redhat.io:$ podman login registry.redhat.ioPull the CPU inference image by running the following command:
$ podman pull registry.redhat.io/rhaii/vllm-cpu-rhel9:3.4.0Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.
$ mkdir -p rhaii-cache && chmod g+rwX rhaii-cacheCreate or append your
HF_TOKENHugging Face token to theprivate.envfile. Source theprivate.envfile.$ echo "export HF_TOKEN=<your_HF_token>" > private.env$ source private.envVerify your CPU instruction set support:
$ grep -q avx2 /proc/cpuinfo && echo "AVX2 supported" || echo "AVX2 not supported" $ grep -q avx512f /proc/cpuinfo && echo "AVX512 supported" || echo "AVX512 not supported" $ grep -q amx_tile /proc/cpuinfo && echo "AVX512 AMX supported" || echo "AVX512 AMX not supported"ImportantAVX2 is the minimum requirement. If your CPU does not support AVX2, you cannot use CPU inference with Red Hat AI Inference. AVX512 and AVX512 AMX improve performance when available, but are not required.
Start the AI Inference container image.
$ podman run --rm -it \ --security-opt=label=disable \ --shm-size=4g -p 8000:8000 \ --userns=keep-id:uid=1001 \ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \ --env "HF_HUB_OFFLINE=0" \ --env "VLLM_CPU_KVCACHE_SPACE=4" \ --env "LD_PRELOAD=/usr/lib64/libomp.so" \ -v ./rhaii-cache:/opt/app-root/src/.cache:Z \ registry.redhat.io/rhaii/vllm-cpu-rhel9:3.4.0 \ --model TinyLlama/TinyLlama-1.1B-Chat-v1.0-
--security-opt=label=disable: Disables SELinux label relabeling for volume mounts. Required for systems with SELinux enabled. Without this option, the container might fail to start. -
--shm-size=4g -p 8000:8000: Specifies the shared memory size and port mapping. Increase--shm-sizeto8 GBif you experience shared memory issues. -
--userns=keep-id:uid=1001: Maps the host UID to the effective UID of the vLLM process in the container. You can also pass--user=0, but this is less secure because it runs vLLM as root inside the container. -
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN": Specifies the Hugging Face API access token. Set and exportHF_TOKENwith your Hugging Face token. -
--env "VLLM_CPU_KVCACHE_SPACE=4": Allocates 4 GB for the CPU key-value cache. This value is suitable for smaller models such as TinyLlama. For larger models such as Llama 3.1 8B, increase this value to 8 GB or more depending on your model size, context length, and available RAM. -
--env "LD_PRELOAD=/usr/lib64/libomp.so": Preloads the OpenMP library for optimal CPU inference performance. Without this setting, you might experience degraded throughput, increased latency, and instability, especially with more than 16 threads. The defaultLD_PRELOADfor this image setsjemalloc, which is not supported for vLLM inference. SettingLD_PRELOADtolibomp.sooverrides that default. If you require PyArrow usage in this image, setLD_PRELOAD=/usr/lib64/libjemalloc.so.2:/usr/lib64/libomp.so, but expect degraded vLLM performance. -
-v ./rhaii-cache:/opt/app-root/src/.cache:Z: Mounts the cache directory with SELinux context. The:Zsuffix is required for systems with SELinux enabled. On Debian, Ubuntu, or Docker without SELinux, omit the:Zsuffix. -
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0: Specifies the Hugging Face model to serve. For CPU inference, use smaller models (under 3B parameters) for optimal performance.
NoteAt startup, the container logs a warning:
WARNING: libtcmalloc is not found in LD_PRELOAD. This may slow down the performance.You can ignore this warning for 3.4 becausetcmallocis not included in the CPU image. TheLD_PRELOADsetting forlibomp.sodelivers the best available performance for this release.-
Verification
In a separate terminal tab, make a request to the model with the API.
curl -X POST -H "Content-Type: application/json" -d '{ "prompt": "What is the capital of France?", "max_tokens": 50 }' http://<your_server_ip>:8000/v1/completions | jqThe model returns a valid JSON response.
11.2. Serve and inference language models on OpenShift Container Platform using CPU inference Copiar o linkLink copiado para a área de transferência!
Deploy a language model on OpenShift Container Platform by using Red Hat AI Inference with CPU inference. CPU inference provides a cost-effective option for development, testing, and small-scale deployments without GPU hardware.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in as a user with
cluster-adminprivileges. - You have access to worker nodes with x86_64 CPUs that support at least the AVX2 instruction set.
- You have a Hugging Face account and have generated a Hugging Face access token.
CPU inference does not require the Node Feature Discovery (NFD) Operator or a GPU Operator. The CPU container supports AVX2, AVX512, and AVX512 Advanced Matrix Extensions (AMX) instruction sets. The container automatically detects and uses the best available instruction set.
Procedure
Create a namespace for the deployment:
$ oc new-project rhaii-cpuCreate a
Secretcustom resource (CR) for the Hugging Face token:$ export HF_TOKEN=<your_HF_token> $ oc create secret generic hf-secret --from-literal=HF_TOKEN=$HF_TOKEN -n rhaii-cpuCreate a Docker secret so that the cluster can download the Red Hat AI Inference image from the container registry:
$ oc create secret generic docker-secret \ --from-file=.dockerconfigjson=$HOME/.docker/config.json \ --type=kubernetes.io/dockerconfigjson -n rhaii-cpuNoteIf you authenticated with
podman logininstead ofdocker login, your credentials are stored in$XDG_RUNTIME_DIR/containers/auth.json. Use that path instead of$HOME/.docker/config.json.Create a
PersistentVolumeClaim(PVC) custom resource (CR) for model storage. Save the following YAML to a file namedpvc.yaml:apiVersion: v1 kind: PersistentVolumeClaim metadata: name: model-cache namespace: rhaii-cpu spec: accessModes: - ReadWriteOnce resources: requests: storage: 10GiApply the PVC:
$ oc apply -f pvc.yamlCreate a
Deploymentcustom resource (CR) for CPU inference. Save the following YAML to a file nameddeployment.yaml:apiVersion: apps/v1 kind: Deployment metadata: name: tinyllama-cpu namespace: rhaii-cpu labels: app: tinyllama-cpu spec: replicas: 1 selector: matchLabels: app: tinyllama-cpu template: metadata: labels: app: tinyllama-cpu spec: imagePullSecrets: - name: docker-secret volumes: - name: model-volume persistentVolumeClaim: claimName: model-cache - name: shm emptyDir: medium: Memory sizeLimit: "4Gi" containers: - name: vllm-cpu image: 'registry.redhat.io/rhaii/vllm-cpu-rhel9:3.4.0' imagePullPolicy: IfNotPresent env: - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name: hf-secret key: HF_TOKEN - name: HF_HUB_OFFLINE value: '0' - name: VLLM_CPU_KVCACHE_SPACE value: '4' # Increase to 8+ for larger models like Llama 3.1 8B - name: LD_PRELOAD value: '/usr/lib64/libomp.so' command: - python - '-m' - vllm.entrypoints.openai.api_server args: - '--port=8000' - '--model=TinyLlama/TinyLlama-1.1B-Chat-v1.0' ports: - containerPort: 8000 protocol: TCP resources: limits: cpu: '<cpu_limit>' memory: '<memory_limit>' requests: cpu: '<cpu_request>' memory: '<memory_request>' volumeMounts: - name: model-volume mountPath: /opt/app-root/src/.cache - name: shm mountPath: /dev/shm restartPolicy: Alwayswhere:
<cpu_request>-
Specifies the minimum CPU cores for the container. Use
4as a starting value for TinyLlama. <cpu_limit>-
Specifies the maximum CPU cores. Use
8as a starting value. <memory_request>-
Specifies the minimum memory. Use
8Gias a starting value for TinyLlama. <memory_limit>Specifies the maximum memory. Use
16Gias a starting value.NoteAdjust resource values based on your model size, context length, and expected concurrent requests.
For more information, see Validated models for x86_64 CPU inference serving.
NoteThe
VLLM_CPU_KVCACHE_SPACEvalue of4is suitable for smaller models such as TinyLlama. Adjust this value based on your model size, desired context length as set bymax-model-len, and expected number of concurrent requests. For larger models such as Llama 3.1 8B or workloads with longer context lengths and higher concurrency, increase this value to8or more depending on your available RAM.Apply the deployment:
$ oc apply -f deployment.yamlNoteThe container logs a warning at startup:
WARNING: libtcmalloc is not found in LD_PRELOAD. This may slow down the performance.You can ignore this warning for 3.4 becausetcmallocis not included in the CPU image. TheLD_PRELOADsetting forlibomp.sodelivers the best available performance for this release.
Watch the deployment to verify that it succeeds:
$ oc get deployment -n rhaii-cpu --watchExample output
NAME READY UP-TO-DATE AVAILABLE AGE tinyllama-cpu 0/1 1 0 10s tinyllama-cpu 1/1 1 1 45sCreate a
ServiceCR for the model inference. Save the following YAML to a file namedservice.yaml:apiVersion: v1 kind: Service metadata: name: tinyllama-cpu namespace: rhaii-cpu spec: selector: app: tinyllama-cpu ports: - protocol: TCP port: 80 targetPort: 8000Apply the service:
$ oc apply -f service.yamlCreate a
RouteCR to expose the model. Save the following YAML to a file namedroute.yaml:apiVersion: route.openshift.io/v1 kind: Route metadata: name: tinyllama-cpu namespace: rhaii-cpu spec: to: kind: Service name: tinyllama-cpu port: targetPort: 80NoteThis route exposes the inference endpoint over plain HTTP. For production deployments, add TLS edge termination to encrypt prompts and model responses in transit.
Apply the route:
$ oc apply -f route.yamlGet the URL for the exposed route:
$ oc get route tinyllama-cpu -n rhaii-cpu -o jsonpath='{.spec.host}'Example output
tinyllama-cpu-rhaii-cpu.apps.example.com
Verification
Query the model to verify the deployment:
$ curl -X POST http://<route_hostname>/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", "prompt": "What is the capital of France?", "max_tokens": 50 }'The model returns a valid JSON response.