Questo contenuto non è disponibile nella lingua selezionata.

Chapter 11. Inference language models on x86_64 CPUs


With CPU inference, you can run Red Hat AI Inference workloads on x86_64 processors without dedicated GPU hardware. This feature provides a cost-effective option for development, testing, and small-scale deployments by using smaller language models.

Important

{feature-name} is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

CPU inference is useful where AI accelerator hardware is unavailable or impractical, such as development and testing environments, edge deployments, small-scale setups, and so on. CPU inference works best with smaller models (under 3 billion parameters), delivers lower throughput than GPU-accelerated inference, and can demand substantial system memory (32 GB RAM or more for 1–3B parameter models). For high-throughput or large-model production workloads, GPU acceleration via NVIDIA CUDA or AMD ROCm is the recommended path.

The AI Inference CPU container image supports multiple instruction set architectures (ISAs) in a single build. The container automatically detects and uses the best available instruction set for your processor.

Expand
Table 11.1. Supported instruction set architectures
Instruction setIntel supportAMD support

AVX2 (minimum)

Haswell (2013) or newer

Excavator (2015) or newer

AVX512

Skylake-X (2017) or newer

Zen 4 (2022) or newer

AVX512 Advanced Matrix Extensions (AMX)

Sapphire Rapids (2023) or newer

Not supported

Note

AVX512 AMX provides hardware acceleration for matrix operations, which can improve inference performance for supported models. Because AMD processors do not support AMX, Intel Sapphire Rapids or newer processors deliver the best CPU inference performance.

11.1. Serve a model with Podman by using CPU on x86_64

Serve and inference a large language model with Podman and Red Hat AI Inference running on x86_64 CPUs.

With CPU-only inference, you can run Red Hat AI Inference workloads on x86_64 CPUs without requiring GPU hardware. This feature provides a cost-effective option for development, testing, and small-scale deployments by using smaller language models. The CPU container image supports AVX2, AVX512, and AVX512 Advanced Matrix Extensions (AMX) instruction sets in a single build by automatically detecting and using the best available instruction set for your processor.

Important

Inference serving with AI Inference on x86_64 CPU is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Prerequisites

  • You have installed Podman or Docker.
  • You have logged in as a user with sudo access.
  • You have access to registry.redhat.io and have logged in.
  • You have a Hugging Face account and have generated a Hugging Face access token.
  • You have access to a Linux server with an x86_64 CPU that supports at least the AVX2 instruction set:

    • AVX2, minimum requirement: Intel Haswell (2013) or newer, or Advanced Micro Devices (AMD) Excavator (2015) or newer
    • AVX512: Intel Skylake-X (2017) or newer, or AMD Zen 4 (2022) or newer
    • AVX512 AMX: Intel Sapphire Rapids (2023) or newer

    The container automatically detects and uses the best available instruction set for your processor.

  • You have a minimum of 16 GB system RAM. Red Hat recommends 32 GB or more for larger models.
Note

CPU inference works best for smaller models, typically under 3 billion parameters. For larger models or production workloads requiring higher throughput, consider using GPU acceleration.

Procedure

  1. Open a terminal on your server host and log in to registry.redhat.io:

    $ podman login registry.redhat.io
  2. Pull the CPU inference image by running the following command:

    $ podman pull registry.redhat.io/rhaii/vllm-cpu-rhel9:3.4.0
  3. Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.

    $ mkdir -p rhaii-cache && chmod g+rwX rhaii-cache
  4. Create or append your HF_TOKEN Hugging Face token to the private.env file. Source the private.env file.

    $ echo "export HF_TOKEN=<your_HF_token>" > private.env
    $ source private.env
  5. Verify your CPU instruction set support:

    $ grep -q avx2 /proc/cpuinfo && echo "AVX2 supported" || echo "AVX2 not supported"
    $ grep -q avx512f /proc/cpuinfo && echo "AVX512 supported" || echo "AVX512 not supported"
    $ grep -q amx_tile /proc/cpuinfo && echo "AVX512 AMX supported" || echo "AVX512 AMX not supported"
    Important

    AVX2 is the minimum requirement. If your CPU does not support AVX2, you cannot use CPU inference with Red Hat AI Inference. AVX512 and AVX512 AMX improve performance when available, but are not required.

  6. Start the AI Inference container image.

    $ podman run --rm -it \
    --security-opt=label=disable \
    --shm-size=4g -p 8000:8000 \
    --userns=keep-id:uid=1001 \
    --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
    --env "HF_HUB_OFFLINE=0" \
    --env "VLLM_CPU_KVCACHE_SPACE=4" \
    --env "LD_PRELOAD=/usr/lib64/libomp.so" \
    -v ./rhaii-cache:/opt/app-root/src/.cache:Z \
    registry.redhat.io/rhaii/vllm-cpu-rhel9:3.4.0 \
    --model TinyLlama/TinyLlama-1.1B-Chat-v1.0
    • --security-opt=label=disable: Disables SELinux label relabeling for volume mounts. Required for systems with SELinux enabled. Without this option, the container might fail to start.
    • --shm-size=4g -p 8000:8000: Specifies the shared memory size and port mapping. Increase --shm-size to 8 GB if you experience shared memory issues.
    • --userns=keep-id:uid=1001: Maps the host UID to the effective UID of the vLLM process in the container. You can also pass --user=0, but this is less secure because it runs vLLM as root inside the container.
    • --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN": Specifies the Hugging Face API access token. Set and export HF_TOKEN with your Hugging Face token.
    • --env "VLLM_CPU_KVCACHE_SPACE=4": Allocates 4 GB for the CPU key-value cache. This value is suitable for smaller models such as TinyLlama. For larger models such as Llama 3.1 8B, increase this value to 8 GB or more depending on your model size, context length, and available RAM.
    • --env "LD_PRELOAD=/usr/lib64/libomp.so": Preloads the OpenMP library for optimal CPU inference performance. Without this setting, you might experience degraded throughput, increased latency, and instability, especially with more than 16 threads. The default LD_PRELOAD for this image sets jemalloc, which is not supported for vLLM inference. Setting LD_PRELOAD to libomp.so overrides that default. If you require PyArrow usage in this image, set LD_PRELOAD=/usr/lib64/libjemalloc.so.2:/usr/lib64/libomp.so, but expect degraded vLLM performance.
    • -v ./rhaii-cache:/opt/app-root/src/.cache:Z: Mounts the cache directory with SELinux context. The :Z suffix is required for systems with SELinux enabled. On Debian, Ubuntu, or Docker without SELinux, omit the :Z suffix.
    • --model TinyLlama/TinyLlama-1.1B-Chat-v1.0: Specifies the Hugging Face model to serve. For CPU inference, use smaller models (under 3B parameters) for optimal performance.
    Note

    At startup, the container logs a warning: WARNING: libtcmalloc is not found in LD_PRELOAD. This may slow down the performance. You can ignore this warning for 3.4 because tcmalloc is not included in the CPU image. The LD_PRELOAD setting for libomp.so delivers the best available performance for this release.

Verification

  • In a separate terminal tab, make a request to the model with the API.

    curl -X POST -H "Content-Type: application/json" -d '{
        "prompt": "What is the capital of France?",
        "max_tokens": 50
    }' http://<your_server_ip>:8000/v1/completions | jq

    The model returns a valid JSON response.

Deploy a language model on OpenShift Container Platform by using Red Hat AI Inference with CPU inference. CPU inference provides a cost-effective option for development, testing, and small-scale deployments without GPU hardware.

Prerequisites

  • You have installed the OpenShift CLI (oc).
  • You have logged in as a user with cluster-admin privileges.
  • You have access to worker nodes with x86_64 CPUs that support at least the AVX2 instruction set.
  • You have a Hugging Face account and have generated a Hugging Face access token.
Note

CPU inference does not require the Node Feature Discovery (NFD) Operator or a GPU Operator. The CPU container supports AVX2, AVX512, and AVX512 Advanced Matrix Extensions (AMX) instruction sets. The container automatically detects and uses the best available instruction set.

Procedure

  1. Create a namespace for the deployment:

    $ oc new-project rhaii-cpu
  2. Create a Secret custom resource (CR) for the Hugging Face token:

    $ export HF_TOKEN=<your_HF_token>
    $ oc create secret generic hf-secret --from-literal=HF_TOKEN=$HF_TOKEN -n rhaii-cpu
  3. Create a Docker secret so that the cluster can download the Red Hat AI Inference image from the container registry:

    $ oc create secret generic docker-secret \
        --from-file=.dockerconfigjson=$HOME/.docker/config.json \
        --type=kubernetes.io/dockerconfigjson -n rhaii-cpu
    Note

    If you authenticated with podman login instead of docker login, your credentials are stored in $XDG_RUNTIME_DIR/containers/auth.json. Use that path instead of $HOME/.docker/config.json.

  4. Create a PersistentVolumeClaim (PVC) custom resource (CR) for model storage. Save the following YAML to a file named pvc.yaml:

    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: model-cache
      namespace: rhaii-cpu
    spec:
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 10Gi

    Apply the PVC:

    $ oc apply -f pvc.yaml
  5. Create a Deployment custom resource (CR) for CPU inference. Save the following YAML to a file named deployment.yaml:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: tinyllama-cpu
      namespace: rhaii-cpu
      labels:
        app: tinyllama-cpu
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: tinyllama-cpu
      template:
        metadata:
          labels:
            app: tinyllama-cpu
        spec:
          imagePullSecrets:
            - name: docker-secret
          volumes:
            - name: model-volume
              persistentVolumeClaim:
                claimName: model-cache
            - name: shm
              emptyDir:
                medium: Memory
                sizeLimit: "4Gi"
          containers:
            - name: vllm-cpu
              image: 'registry.redhat.io/rhaii/vllm-cpu-rhel9:3.4.0'
              imagePullPolicy: IfNotPresent
              env:
                - name: HUGGING_FACE_HUB_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: hf-secret
                      key: HF_TOKEN
                - name: HF_HUB_OFFLINE
                  value: '0'
                - name: VLLM_CPU_KVCACHE_SPACE
                  value: '4'  # Increase to 8+ for larger models like Llama 3.1 8B
                - name: LD_PRELOAD
                  value: '/usr/lib64/libomp.so'
              command:
                - python
                - '-m'
                - vllm.entrypoints.openai.api_server
              args:
                - '--port=8000'
                - '--model=TinyLlama/TinyLlama-1.1B-Chat-v1.0'
              ports:
                - containerPort: 8000
                  protocol: TCP
              resources:
                limits:
                  cpu: '<cpu_limit>'
                  memory: '<memory_limit>'
                requests:
                  cpu: '<cpu_request>'
                  memory: '<memory_request>'
              volumeMounts:
                - name: model-volume
                  mountPath: /opt/app-root/src/.cache
                - name: shm
                  mountPath: /dev/shm
          restartPolicy: Always

    where:

    <cpu_request>
    Specifies the minimum CPU cores for the container. Use 4 as a starting value for TinyLlama.
    <cpu_limit>
    Specifies the maximum CPU cores. Use 8 as a starting value.
    <memory_request>
    Specifies the minimum memory. Use 8Gi as a starting value for TinyLlama.
    <memory_limit>

    Specifies the maximum memory. Use 16Gi as a starting value.

    Note

    Adjust resource values based on your model size, context length, and expected concurrent requests.

    For more information, see Validated models for x86_64 CPU inference serving.

    Note

    The VLLM_CPU_KVCACHE_SPACE value of 4 is suitable for smaller models such as TinyLlama. Adjust this value based on your model size, desired context length as set by max-model-len, and expected number of concurrent requests. For larger models such as Llama 3.1 8B or workloads with longer context lengths and higher concurrency, increase this value to 8 or more depending on your available RAM.

    Apply the deployment:

    $ oc apply -f deployment.yaml
    Note

    The container logs a warning at startup: WARNING: libtcmalloc is not found in LD_PRELOAD. This may slow down the performance. You can ignore this warning for 3.4 because tcmalloc is not included in the CPU image. The LD_PRELOAD setting for libomp.so delivers the best available performance for this release.

  6. Watch the deployment to verify that it succeeds:

    $ oc get deployment -n rhaii-cpu --watch

    Example output

    NAME            READY   UP-TO-DATE   AVAILABLE   AGE
    tinyllama-cpu   0/1     1            0           10s
    tinyllama-cpu   1/1     1            1           45s

  7. Create a Service CR for the model inference. Save the following YAML to a file named service.yaml:

    apiVersion: v1
    kind: Service
    metadata:
      name: tinyllama-cpu
      namespace: rhaii-cpu
    spec:
      selector:
        app: tinyllama-cpu
      ports:
        - protocol: TCP
          port: 80
          targetPort: 8000

    Apply the service:

    $ oc apply -f service.yaml
  8. Create a Route CR to expose the model. Save the following YAML to a file named route.yaml:

    apiVersion: route.openshift.io/v1
    kind: Route
    metadata:
      name: tinyllama-cpu
      namespace: rhaii-cpu
    spec:
      to:
        kind: Service
        name: tinyllama-cpu
      port:
        targetPort: 80
    Note

    This route exposes the inference endpoint over plain HTTP. For production deployments, add TLS edge termination to encrypt prompts and model responses in transit.

    Apply the route:

    $ oc apply -f route.yaml
  9. Get the URL for the exposed route:

    $ oc get route tinyllama-cpu -n rhaii-cpu -o jsonpath='{.spec.host}'

    Example output

    tinyllama-cpu-rhaii-cpu.apps.example.com

Verification

  • Query the model to verify the deployment:

    $ curl -X POST http://<route_hostname>/v1/completions \
        -H "Content-Type: application/json" \
        -d '{
          "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
          "prompt": "What is the capital of France?",
          "max_tokens": 50
        }'

    The model returns a valid JSON response.

Red Hat logoGithubredditYoutubeTwitter

Formazione

Prova, acquista e vendi

Community

Informazioni su Red Hat

Forniamo soluzioni consolidate che rendono più semplice per le aziende lavorare su piattaforme e ambienti diversi, dal datacenter centrale all'edge della rete.

Rendiamo l’open source più inclusivo

Red Hat si impegna a sostituire il linguaggio problematico nel codice, nella documentazione e nelle proprietà web. Per maggiori dettagli, visita il Blog di Red Hat.

Informazioni sulla documentazione di Red Hat

Legal Notice

Theme

© 2026 Red Hat
Torna in cima