이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 10. Serving and inferencing with Podman using CPU (x86_64 AVX2)


Serve and inference a large language model with Podman and Red Hat AI Inference Server running on x86_64 CPUs with AVX2 instruction set support.

With CPU-only inference, you can run Red Hat AI Inference Server workloads on x86_64 CPUs without requiring GPU hardware. This feature provides a cost-effective option for development, testing, and small-scale deployments using smaller language models.

Important

{feature-name} is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Note

AVX512 instruction set support is planned for a future release.

Prerequisites

  • You have installed Podman or Docker.
  • You are logged in as a user with sudo access.
  • You have access to registry.redhat.io and have logged in.
  • You have a Hugging Face account and have generated a Hugging Face access token.
  • You have access to a Linux server with an x86_64 CPU that supports the AVX2 instruction set:

    • Intel Haswell (2013) or newer processors
    • AMD Excavator (2015) or newer processors
  • You have a minimum of 16GB system RAM. 32GB or more is recommended for larger models.
Note

CPU inference is optimized for smaller models, typically under 3 billion parameters. For larger models or production workloads requiring higher throughput, consider using GPU acceleration.

Procedure

  1. Open a terminal on your server host, and log in to registry.redhat.io:

    $ podman login registry.redhat.io
    Copy to Clipboard Toggle word wrap
  2. Pull the CPU inference image by running the following command:

    $ podman pull registry.redhat.io/rhaiis/vllm-cpu-rhel9:3.3.0
    Copy to Clipboard Toggle word wrap
  3. Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.

    $ mkdir -p rhaiis-cache && chmod g+rwX rhaiis-cache
    Copy to Clipboard Toggle word wrap
  4. Create or append your HF_TOKEN Hugging Face token to the private.env file. Source the private.env file.

    $ echo "export HF_TOKEN=<your_HF_token>" > private.env
    Copy to Clipboard Toggle word wrap
    $ source private.env
    Copy to Clipboard Toggle word wrap
  5. Verify that your CPU supports the AVX2 instruction set:

    $ grep -q avx2 /proc/cpuinfo && echo "AVX2 supported" || echo "AVX2 not supported"
    Copy to Clipboard Toggle word wrap
    Important

    If your CPU does not support AVX2, you cannot use CPU inference with Red Hat AI Inference Server.

  6. Start the AI Inference Server container image.

    $ podman run --rm -it \
    --security-opt=label=disable \
    --shm-size=4g -p 8000:8000 \
    --userns=keep-id:uid=1001 \
    --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
    --env "HF_HUB_OFFLINE=0" \
    --env "VLLM_CPU_KVCACHE_SPACE=4" \
    -v ./rhaiis-cache:/opt/app-root/src/.cache:Z \
    registry.redhat.io/rhaiis/vllm-cpu-rhel9:3.3.0 \
    --model TinyLlama/TinyLlama-1.1B-Chat-v1.0
    Copy to Clipboard Toggle word wrap
    • --security-opt=label=disable: Disables SELinux label relabeling for volume mounts. Required for systems where SELinux is enabled. Without this option, the container might fail to start.
    • --shm-size=4g -p 8000:8000: Specifies the shared memory size and port mapping. Increase --shm-size to 8GB if you experience shared memory issues.
    • --userns=keep-id:uid=1001: Maps the host UID to the effective UID of the vLLM process in the container. Alternatively, you can pass --user=0, but this is less secure because it runs vLLM as root inside the container.
    • --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN": Specifies the Hugging Face API access token. Set and export HF_TOKEN with your Hugging Face token.
    • --env "VLLM_CPU_KVCACHE_SPACE=4": Allocates 4GB for the CPU key-value cache. Increase this value for larger models or longer context lengths. The default is 4GB.
    • -v ./rhaiis-cache:/opt/app-root/src/.cache:Z: Mounts the cache directory with SELinux context. The :Z suffix is required for systems where SELinux is enabled. On Debian, Ubuntu, or Docker without SELinux, omit the :Z suffix.
    • --model TinyLlama/TinyLlama-1.1B-Chat-v1.0: Specifies the Hugging Face model to serve. For CPU inference, use smaller models (under 3B parameters) for optimal performance.

Verification

  • In a separate tab in your terminal, make a request to the model with the API.

    curl -X POST -H "Content-Type: application/json" -d '{
        "prompt": "What is the capital of France?",
        "max_tokens": 50
    }' http://<your_server_ip>:8000/v1/completions | jq
    Copy to Clipboard Toggle word wrap

    The model returns a valid JSON response answering your question.

Red Hat logoGithubredditYoutubeTwitter

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 문서 정보

Red Hat을 사용하는 고객은 신뢰할 수 있는 콘텐츠가 포함된 제품과 서비스를 통해 혁신하고 목표를 달성할 수 있습니다. 최신 업데이트를 확인하세요.

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat은 코드, 문서, 웹 속성에서 문제가 있는 언어를 교체하기 위해 최선을 다하고 있습니다. 자세한 내용은 다음을 참조하세요.Red Hat 블로그.

Red Hat 소개

Red Hat은 기업이 핵심 데이터 센터에서 네트워크 에지에 이르기까지 플랫폼과 환경 전반에서 더 쉽게 작업할 수 있도록 강화된 솔루션을 제공합니다.

Theme

© 2026 Red Hat
맨 위로 이동