Este contenido no está disponible en el idioma seleccionado.

Chapter 10. Validated models for x86_64 CPU inference serving

The following large language models have been validated for use with Red Hat AI Inference Server on x86_64 CPUs with AVX2 instruction set support. CPU inference is optimized for smaller models that can run efficiently without GPU acceleration.

Note

x86_64 CPU inference is best suited for smaller models, typically under 3 billion parameters. Performance depends on your CPU specifications, available system RAM, and model size. For larger models or production workloads requiring high throughput, consider using GPU acceleration.

Important

{feature-name} is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Expand

Table 10.1. Validated models for inferencing with x86_64 CPU
Model	Hugging Face model card	Number of parameters
TinyLlama-1.1B-Chat-v1.0	TinyLlama/TinyLlama-1.1B-Chat-v1.0	1.1B
Llama-3.2-1B-Instruct	meta-llama/Llama-3.2-1B-Instruct	1B
granite-3.2-2b-instruct	ibm-granite/granite-3.2-2b-instruct	2B
TinyLlama-1.1B-Chat-v1.0-pruned2.4	RedHatAI/TinyLlama-1.1B-Chat-v1.0-pruned2.4	1.1B (pruned)
TinyLlama-1.1B-Chat-v0.4-pruned50-quant-ds	RedHatAI/TinyLlama-1.1B-Chat-v0.4-pruned50-quant-ds	1.1B (pruned + quantized)
opt-125m	facebook/opt-125m	125M
Qwen2-0.5B-Instruct-AWQ	Qwen/Qwen2-0.5B-Instruct-AWQ	0.5B

Important

Quantization formats that require GPU-specific kernels, such as Marlin format, are not supported for CPU inference. Use AWQ or GPTQ quantization formats that are compatible with CPU execution.

The following table provides general guidance for approximate system RAM requirements based on model size:

Expand

Table 10.2. Memory requirements for inference serving with x86_64 CPU
Model size	Minimum RAM	Recommended RAM
125M - 500M	8GB	16GB
500M - 1B	16GB	32GB
1B - 3B	32GB	64GB

Note

Actual memory usage depends on the model architecture, context length, and batch size. Increase the VLLM_CPU_KVCACHE_SPACE environment variable to allocate more memory for the key-value cache when using longer context lengths.

Este contenido no está disponible en el idioma seleccionado.

Chapter 10. Validated models for x86_64 CPU inference serving

Aprender

Pruebe, compre y venda

Comunidades

Acerca de Red Hat

Hacer que el código abierto sea más inclusivo

Acerca de la documentación de Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links