이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 10. Validated models for x86_64 CPU inference serving


The following large language models have been validated for use with Red Hat AI Inference Server on x86_64 CPUs with AVX2 instruction set support. CPU inference is optimized for smaller models that can run efficiently without GPU acceleration.

Note

x86_64 CPU inference is best suited for smaller models, typically under 3 billion parameters. Performance depends on your CPU specifications, available system RAM, and model size. For larger models or production workloads requiring high throughput, consider using GPU acceleration.

Important

{feature-name} is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Expand
Table 10.1. Validated models for inferencing with x86_64 CPU
ModelHugging Face model cardNumber of parameters

TinyLlama-1.1B-Chat-v1.0

TinyLlama/TinyLlama-1.1B-Chat-v1.0

1.1B

Llama-3.2-1B-Instruct

meta-llama/Llama-3.2-1B-Instruct

1B

granite-3.2-2b-instruct

ibm-granite/granite-3.2-2b-instruct

2B

TinyLlama-1.1B-Chat-v1.0-pruned2.4

RedHatAI/TinyLlama-1.1B-Chat-v1.0-pruned2.4

1.1B (pruned)

TinyLlama-1.1B-Chat-v0.4-pruned50-quant-ds

RedHatAI/TinyLlama-1.1B-Chat-v0.4-pruned50-quant-ds

1.1B (pruned + quantized)

opt-125m

facebook/opt-125m

125M

Qwen2-0.5B-Instruct-AWQ

Qwen/Qwen2-0.5B-Instruct-AWQ

0.5B

Important

Quantization formats that require GPU-specific kernels, such as Marlin format, are not supported for CPU inference. Use AWQ or GPTQ quantization formats that are compatible with CPU execution.

The following table provides general guidance for approximate system RAM requirements based on model size:

Expand
Table 10.2. Memory requirements for inference serving with x86_64 CPU
Model sizeMinimum RAMRecommended RAM

125M - 500M

8GB

16GB

500M - 1B

16GB

32GB

1B - 3B

32GB

64GB

Note

Actual memory usage depends on the model architecture, context length, and batch size. Increase the VLLM_CPU_KVCACHE_SPACE environment variable to allocate more memory for the key-value cache when using longer context lengths.

Red Hat logoGithubredditYoutubeTwitter

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 문서 정보

Red Hat을 사용하는 고객은 신뢰할 수 있는 콘텐츠가 포함된 제품과 서비스를 통해 혁신하고 목표를 달성할 수 있습니다. 최신 업데이트를 확인하세요.

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat은 코드, 문서, 웹 속성에서 문제가 있는 언어를 교체하기 위해 최선을 다하고 있습니다. 자세한 내용은 다음을 참조하세요.Red Hat 블로그.

Red Hat 소개

Red Hat은 기업이 핵심 데이터 센터에서 네트워크 에지에 이르기까지 플랫폼과 환경 전반에서 더 쉽게 작업할 수 있도록 강화된 솔루션을 제공합니다.

Theme

© 2026 Red Hat
맨 위로 이동