Chapter 10. Serving and inferencing with Podman using CPU (x86_64 AVX2)


Serve and inference a large language model with Podman and Red Hat AI Inference Server running on x86_64 CPUs with AVX2 instruction set support.

With CPU-only inference, you can run Red Hat AI Inference Server workloads on x86_64 CPUs without requiring GPU hardware. This feature provides a cost-effective option for development, testing, and small-scale deployments using smaller language models.

Important

{feature-name} is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Note

AVX512 instruction set support is planned for a future release.

Prerequisites

  • You have installed Podman or Docker.
  • You are logged in as a user with sudo access.
  • You have access to registry.redhat.io and have logged in.
  • You have a Hugging Face account and have generated a Hugging Face access token.
  • You have access to a Linux server with an x86_64 CPU that supports the AVX2 instruction set:

    • Intel Haswell (2013) or newer processors
    • AMD Excavator (2015) or newer processors
  • You have a minimum of 16GB system RAM. 32GB or more is recommended for larger models.
Note

CPU inference is optimized for smaller models, typically under 3 billion parameters. For larger models or production workloads requiring higher throughput, consider using GPU acceleration.

Procedure

  1. Open a terminal on your server host, and log in to registry.redhat.io:

    $ podman login registry.redhat.io
    Copy to Clipboard Toggle word wrap
  2. Pull the CPU inference image by running the following command:

    $ podman pull registry.redhat.io/rhaiis/vllm-cpu-rhel9:3.3.0
    Copy to Clipboard Toggle word wrap
  3. Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.

    $ mkdir -p rhaiis-cache && chmod g+rwX rhaiis-cache
    Copy to Clipboard Toggle word wrap
  4. Create or append your HF_TOKEN Hugging Face token to the private.env file. Source the private.env file.

    $ echo "export HF_TOKEN=<your_HF_token>" > private.env
    Copy to Clipboard Toggle word wrap
    $ source private.env
    Copy to Clipboard Toggle word wrap
  5. Verify that your CPU supports the AVX2 instruction set:

    $ grep -q avx2 /proc/cpuinfo && echo "AVX2 supported" || echo "AVX2 not supported"
    Copy to Clipboard Toggle word wrap
    Important

    If your CPU does not support AVX2, you cannot use CPU inference with Red Hat AI Inference Server.

  6. Start the AI Inference Server container image.

    $ podman run --rm -it \
    --security-opt=label=disable \
    --shm-size=4g -p 8000:8000 \
    --userns=keep-id:uid=1001 \
    --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
    --env "HF_HUB_OFFLINE=0" \
    --env "VLLM_CPU_KVCACHE_SPACE=4" \
    -v ./rhaiis-cache:/opt/app-root/src/.cache:Z \
    registry.redhat.io/rhaiis/vllm-cpu-rhel9:3.3.0 \
    --model TinyLlama/TinyLlama-1.1B-Chat-v1.0
    Copy to Clipboard Toggle word wrap
    • --security-opt=label=disable: Disables SELinux label relabeling for volume mounts. Required for systems where SELinux is enabled. Without this option, the container might fail to start.
    • --shm-size=4g -p 8000:8000: Specifies the shared memory size and port mapping. Increase --shm-size to 8GB if you experience shared memory issues.
    • --userns=keep-id:uid=1001: Maps the host UID to the effective UID of the vLLM process in the container. Alternatively, you can pass --user=0, but this is less secure because it runs vLLM as root inside the container.
    • --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN": Specifies the Hugging Face API access token. Set and export HF_TOKEN with your Hugging Face token.
    • --env "VLLM_CPU_KVCACHE_SPACE=4": Allocates 4GB for the CPU key-value cache. Increase this value for larger models or longer context lengths. The default is 4GB.
    • -v ./rhaiis-cache:/opt/app-root/src/.cache:Z: Mounts the cache directory with SELinux context. The :Z suffix is required for systems where SELinux is enabled. On Debian, Ubuntu, or Docker without SELinux, omit the :Z suffix.
    • --model TinyLlama/TinyLlama-1.1B-Chat-v1.0: Specifies the Hugging Face model to serve. For CPU inference, use smaller models (under 3B parameters) for optimal performance.

Verification

  • In a separate tab in your terminal, make a request to the model with the API.

    curl -X POST -H "Content-Type: application/json" -d '{
        "prompt": "What is the capital of France?",
        "max_tokens": 50
    }' http://<your_server_ip>:8000/v1/completions | jq
    Copy to Clipboard Toggle word wrap

    The model returns a valid JSON response answering your question.

Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2026 Red Hat
Back to top