Chapter 10. Validated models for x86_64 CPU inference serving

The following large language models have been validated for use with Red Hat AI Inference on x86_64 CPUs.

Expand

Table 10.1. Validated models for inferencing with x86_64 CPU
Model	Hugging Face model card	Number of parameters
TinyLlama-1.1B-Chat-v1.0	TinyLlama/TinyLlama-1.1B-Chat-v1.0	1.1B
Llama-3.2-1B-Instruct	meta-llama/Llama-3.2-1B-Instruct	1B
granite-3.2-2b-instruct	ibm-granite/granite-3.2-2b-instruct	2B
TinyLlama-1.1B-Chat-v1.0-pruned2.4	RedHatAI/TinyLlama-1.1B-Chat-v1.0-pruned2.4	1.1B (pruned)

Important

Quantization formats that require GPU-specific kernels, such as Marlin format, are not supported for CPU inference. Use AWQ or GPTQ quantization formats that are compatible with CPU execution.

The following table provides general guidance for approximate system RAM requirements based on model size:

Expand

Table 10.2. Memory requirements for inference serving with x86_64 CPU
Model size	Minimum RAM	Recommended RAM
125M - 500M	8 GB	16 GB
500M - 1B	16 GB	32 GB
1B - 3B	32 GB	64 GB

Note

Actual memory usage depends on the model architecture, context length, and batch size. Increase the VLLM_CPU_KVCACHE_SPACE environment variable to allocate more memory for the key-value cache when using longer context lengths.

Chapter 10. Validated models for x86_64 CPU inference serving

Learn

Try, buy, & sell

Communities

About Red Hat

Making open source more inclusive

About Red Hat Documentation

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links