Este conteúdo não está disponível no idioma selecionado.

Chapter 11. Inference language models on x86_64 CPUs

With CPU inference, you can run Red Hat AI Inference workloads on x86_64 processors without dedicated GPU hardware. This feature provides a cost-effective option for development, testing, and small-scale deployments by using smaller language models.

Important

{feature-name} is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

CPU inference is useful where AI accelerator hardware is unavailable or impractical, such as development and testing environments, edge deployments, small-scale setups, and so on. CPU inference works best with smaller models (under 3 billion parameters), delivers lower throughput than GPU-accelerated inference, and can demand substantial system memory (32 GB RAM or more for 1–3B parameter models). For high-throughput or large-model production workloads, GPU acceleration via NVIDIA CUDA or AMD ROCm is the recommended path.

The AI Inference CPU container image supports multiple instruction set architectures (ISAs) in a single build. The container automatically detects and uses the best available instruction set for your processor.

Expand

Table 11.1. Supported instruction set architectures
Instruction set	Intel support	AMD support
AVX2 (minimum)	Haswell (2013) or newer	Excavator (2015) or newer
AVX512	Skylake-X (2017) or newer	Zen 4 (2022) or newer
AVX512 Advanced Matrix Extensions (AMX)	Sapphire Rapids (2023) or newer	Not supported

Note

AVX512 AMX provides hardware acceleration for matrix operations, which can improve inference performance for supported models. Because AMD processors do not support AMX, Intel Sapphire Rapids or newer processors deliver the best CPU inference performance.

11.1. Serve a model with Podman by using CPU on x86_64
Copiar o link

Serve and inference a large language model with Podman and Red Hat AI Inference running on x86_64 CPUs.

With CPU-only inference, you can run Red Hat AI Inference workloads on x86_64 CPUs without requiring GPU hardware. This feature provides a cost-effective option for development, testing, and small-scale deployments by using smaller language models. The CPU container image supports AVX2, AVX512, and AVX512 Advanced Matrix Extensions (AMX) instruction sets in a single build by automatically detecting and using the best available instruction set for your processor.

Important

Inference serving with AI Inference on x86_64 CPU is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Prerequisites

You have installed Podman or Docker.
You have logged in as a user with sudo access.
You have access to registry.redhat.io and have logged in.
You have a Hugging Face account and have generated a Hugging Face access token.
You have access to a Linux server with an x86_64 CPU that supports at least the AVX2 instruction set:
- AVX2, minimum requirement: Intel Haswell (2013) or newer, or Advanced Micro Devices (AMD) Excavator (2015) or newer
- AVX512: Intel Skylake-X (2017) or newer, or AMD Zen 4 (2022) or newer
- AVX512 AMX: Intel Sapphire Rapids (2023) or newer
The container automatically detects and uses the best available instruction set for your processor.
You have a minimum of 16 GB system RAM. Red Hat recommends 32 GB or more for larger models.

Note

CPU inference works best for smaller models, typically under 3 billion parameters. For larger models or production workloads requiring higher throughput, consider using GPU acceleration.

Procedure

Open a terminal on your server host and log in to registry.redhat.io:
```
$ podman login registry.redhat.io
```
Pull the CPU inference image by running the following command:
```
$ podman pull registry.redhat.io/rhaii/vllm-cpu-rhel9:3.4.0
```
Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.
```
$ mkdir -p rhaii-cache && chmod g+rwX rhaii-cache
```
Create or append your HF_TOKEN Hugging Face token to the private.env file. Source the private.env file.
```
$ echo "export HF_TOKEN=<your_HF_token>" > private.env
```
```
$ source private.env
```

Verify your CPU instruction set support:

$ grep -q avx2 /proc/cpuinfo && echo "AVX2 supported" || echo "AVX2 not supported"
$ grep -q avx512f /proc/cpuinfo && echo "AVX512 supported" || echo "AVX512 not supported"
$ grep -q amx_tile /proc/cpuinfo && echo "AVX512 AMX supported" || echo "AVX512 AMX not supported"

Important

AVX2 is the minimum requirement. If your CPU does not support AVX2, you cannot use CPU inference with Red Hat AI Inference. AVX512 and AVX512 AMX improve performance when available, but are not required.

Start the AI Inference container image.
```
$ podman run --rm -it \
--security-opt=label=disable \
--shm-size=4g -p 8000:8000 \
--userns=keep-id:uid=1001 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
--env "VLLM_CPU_KVCACHE_SPACE=4" \
--env "LD_PRELOAD=/usr/lib64/libomp.so" \
-v ./rhaii-cache:/opt/app-root/src/.cache:Z \
registry.redhat.io/rhaii/vllm-cpu-rhel9:3.4.0 \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0
```
- --security-opt=label=disable: Disables SELinux label relabeling for volume mounts. Required for systems with SELinux enabled. Without this option, the container might fail to start.
- --shm-size=4g -p 8000:8000: Specifies the shared memory size and port mapping. Increase --shm-size to 8 GB if you experience shared memory issues.
- --userns=keep-id:uid=1001: Maps the host UID to the effective UID of the vLLM process in the container. You can also pass --user=0, but this is less secure because it runs vLLM as root inside the container.
- --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN": Specifies the Hugging Face API access token. Set and export HF_TOKEN with your Hugging Face token.
- --env "VLLM_CPU_KVCACHE_SPACE=4": Allocates 4 GB for the CPU key-value cache. This value is suitable for smaller models such as TinyLlama. For larger models such as Llama 3.1 8B, increase this value to 8 GB or more depending on your model size, context length, and available RAM.
- --env "LD_PRELOAD=/usr/lib64/libomp.so": Preloads the OpenMP library for optimal CPU inference performance. Without this setting, you might experience degraded throughput, increased latency, and instability, especially with more than 16 threads. The default LD_PRELOAD for this image sets jemalloc, which is not supported for vLLM inference. Setting LD_PRELOAD to libomp.so overrides that default. If you require PyArrow usage in this image, set LD_PRELOAD=/usr/lib64/libjemalloc.so.2:/usr/lib64/libomp.so, but expect degraded vLLM performance.
- -v ./rhaii-cache:/opt/app-root/src/.cache:Z: Mounts the cache directory with SELinux context. The :Z suffix is required for systems with SELinux enabled. On Debian, Ubuntu, or Docker without SELinux, omit the :Z suffix.
- --model TinyLlama/TinyLlama-1.1B-Chat-v1.0: Specifies the Hugging Face model to serve. For CPU inference, use smaller models (under 3B parameters) for optimal performance.
Note
At startup, the container logs a warning: WARNING: libtcmalloc is not found in LD_PRELOAD. This may slow down the performance. You can ignore this warning for 3.4 because tcmalloc is not included in the CPU image. The LD_PRELOAD setting for libomp.so delivers the best available performance for this release.

Verification

In a separate terminal tab, make a request to the model with the API.

curl -X POST -H "Content-Type: application/json" -d '{
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://<your_server_ip>:8000/v1/completions | jq

The model returns a valid JSON response.

11.2. Serve and inference language models on OpenShift Container Platform using CPU inference
Copiar o link

Deploy a language model on OpenShift Container Platform by using Red Hat AI Inference with CPU inference. CPU inference provides a cost-effective option for development, testing, and small-scale deployments without GPU hardware.

Prerequisites

You have installed the OpenShift CLI (oc).
You have logged in as a user with cluster-admin privileges.
You have access to worker nodes with x86_64 CPUs that support at least the AVX2 instruction set.
You have a Hugging Face account and have generated a Hugging Face access token.

Note

CPU inference does not require the Node Feature Discovery (NFD) Operator or a GPU Operator. The CPU container supports AVX2, AVX512, and AVX512 Advanced Matrix Extensions (AMX) instruction sets. The container automatically detects and uses the best available instruction set.

Procedure

Create a namespace for the deployment:
```
$ oc new-project rhaii-cpu
```

Create a Secret custom resource (CR) for the Hugging Face token:

$ export HF_TOKEN=<your_HF_token>
$ oc create secret generic hf-secret --from-literal=HF_TOKEN=$HF_TOKEN -n rhaii-cpu

Create a Docker secret so that the cluster can download the Red Hat AI Inference image from the container registry:
```
$ oc create secret generic docker-secret \
    --from-file=.dockerconfigjson=$HOME/.docker/config.json \
    --type=kubernetes.io/dockerconfigjson -n rhaii-cpu
```
Note
If you authenticated with podman login instead of docker login, your credentials are stored in $XDG_RUNTIME_DIR/containers/auth.json. Use that path instead of $HOME/.docker/config.json.

Create a PersistentVolumeClaim (PVC) custom resource (CR) for model storage. Save the following YAML to a file named pvc.yaml:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache
  namespace: rhaii-cpu
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi

Apply the PVC:

$ oc apply -f pvc.yaml

Create a Deployment custom resource (CR) for CPU inference. Save the following YAML to a file named deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tinyllama-cpu
  namespace: rhaii-cpu
  labels:
    app: tinyllama-cpu
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tinyllama-cpu
  template:
    metadata:
      labels:
        app: tinyllama-cpu
    spec:
      imagePullSecrets:
        - name: docker-secret
      volumes:
        - name: model-volume
          persistentVolumeClaim:
            claimName: model-cache
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: "4Gi"
      containers:
        - name: vllm-cpu
          image: 'registry.redhat.io/rhaii/vllm-cpu-rhel9:3.4.0'
          imagePullPolicy: IfNotPresent
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: HF_TOKEN
            - name: HF_HUB_OFFLINE
              value: '0'
            - name: VLLM_CPU_KVCACHE_SPACE
              value: '4'  # Increase to 8+ for larger models like Llama 3.1 8B
            - name: LD_PRELOAD
              value: '/usr/lib64/libomp.so'
          command:
            - python
            - '-m'
            - vllm.entrypoints.openai.api_server
          args:
            - '--port=8000'
            - '--model=TinyLlama/TinyLlama-1.1B-Chat-v1.0'
          ports:
            - containerPort: 8000
              protocol: TCP
          resources:
            limits:
              cpu: '<cpu_limit>'
              memory: '<memory_limit>'
            requests:
              cpu: '<cpu_request>'
              memory: '<memory_request>'
          volumeMounts:
            - name: model-volume
              mountPath: /opt/app-root/src/.cache
            - name: shm
              mountPath: /dev/shm
      restartPolicy: Always

where:

<cpu_request>

Specifies the minimum CPU cores for the container. Use 4 as a starting value for TinyLlama.

<cpu_limit>

Specifies the maximum CPU cores. Use 8 as a starting value.

<memory_request>

Specifies the minimum memory. Use 8Gi as a starting value for TinyLlama.

<memory_limit>

Specifies the maximum memory. Use 16Gi as a starting value.

Note

Adjust resource values based on your model size, context length, and expected concurrent requests.

For more information, see Validated models for x86_64 CPU inference serving.

Note

The VLLM_CPU_KVCACHE_SPACE value of 4 is suitable for smaller models such as TinyLlama. Adjust this value based on your model size, desired context length as set by max-model-len, and expected number of concurrent requests. For larger models such as Llama 3.1 8B or workloads with longer context lengths and higher concurrency, increase this value to 8 or more depending on your available RAM.

Apply the deployment:

$ oc apply -f deployment.yaml

Note

The container logs a warning at startup: WARNING: libtcmalloc is not found in LD_PRELOAD. This may slow down the performance. You can ignore this warning for 3.4 because tcmalloc is not included in the CPU image. The LD_PRELOAD setting for libomp.so delivers the best available performance for this release.

Watch the deployment to verify that it succeeds:

$ oc get deployment -n rhaii-cpu --watch

Example output

NAME            READY   UP-TO-DATE   AVAILABLE   AGE
tinyllama-cpu   0/1     1            0           10s
tinyllama-cpu   1/1     1            1           45s

Create a Service CR for the model inference. Save the following YAML to a file named service.yaml:

apiVersion: v1
kind: Service
metadata:
  name: tinyllama-cpu
  namespace: rhaii-cpu
spec:
  selector:
    app: tinyllama-cpu
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8000

Apply the service:

$ oc apply -f service.yaml

Create a Route CR to expose the model. Save the following YAML to a file named route.yaml:
```
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: tinyllama-cpu
  namespace: rhaii-cpu
spec:
  to:
    kind: Service
    name: tinyllama-cpu
  port:
    targetPort: 80
```
Note
This route exposes the inference endpoint over plain HTTP. For production deployments, add TLS edge termination to encrypt prompts and model responses in transit.
Apply the route:
```
$ oc apply -f route.yaml
```

Get the URL for the exposed route:

$ oc get route tinyllama-cpu -n rhaii-cpu -o jsonpath='{.spec.host}'

Example output

tinyllama-cpu-rhaii-cpu.apps.example.com

Verification

Query the model to verify the deployment:

$ curl -X POST http://<route_hostname>/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
      "prompt": "What is the capital of France?",
      "max_tokens": 50
    }'

The model returns a valid JSON response.

Este conteúdo não está disponível no idioma selecionado.

Chapter 11. Inference language models on x86_64 CPUs

11.1. Serve a model with Podman by using CPU on x86_64
Copiar o link

11.2. Serve and inference language models on OpenShift Container Platform using CPU inference
Copiar o link

Aprender

Experimente, compre e venda

Comunidades

Sobre a Red Hat

Tornando o open source mais inclusivo

Sobre a documentação da Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Este conteúdo não está disponível no idioma selecionado.

Chapter 11. Inference language models on x86_64 CPUs

11.1. Serve a model with Podman by using CPU on x86_64Copiar o linkLink copiado para a área de transferência!

11.2. Serve and inference language models on OpenShift Container Platform using CPU inferenceCopiar o linkLink copiado para a área de transferência!

Aprender

Experimente, compre e venda

Comunidades

Sobre a Red Hat

Tornando o open source mais inclusivo

Sobre a documentação da Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

11.1. Serve a model with Podman by using CPU on x86_64
Copiar o link

11.2. Serve and inference language models on OpenShift Container Platform using CPU inference
Copiar o link