Chapter 10. Validated models for x86_64 CPU inference serving
The following large language models have been validated for use with Red Hat AI Inference on x86_64 CPUs.
| Model | Hugging Face model card | Number of parameters |
|---|---|---|
| TinyLlama-1.1B-Chat-v1.0 | 1.1B | |
| Llama-3.2-1B-Instruct | 1B | |
| granite-3.2-2b-instruct | 2B | |
| TinyLlama-1.1B-Chat-v1.0-pruned2.4 | 1.1B (pruned) |
Quantization formats that require GPU-specific kernels, such as Marlin format, are not supported for CPU inference. Use AWQ or GPTQ quantization formats that are compatible with CPU execution.
The following table provides general guidance for approximate system RAM requirements based on model size:
| Model size | Minimum RAM | Recommended RAM |
|---|---|---|
| 125M - 500M | 8 GB | 16 GB |
| 500M - 1B | 16 GB | 32 GB |
| 1B - 3B | 32 GB | 64 GB |
Actual memory usage depends on the model architecture, context length, and batch size. Increase the VLLM_CPU_KVCACHE_SPACE environment variable to allocate more memory for the key-value cache when using longer context lengths.