이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 2. Version 3.3.1 release notes

Red Hat AI Inference Server 3.3.1 is a maintenance release containing security fixes, bug fixes, and minor enhancements.

The following container images are available from registry.redhat.io:

registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.1
registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.3.1
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.3.1

2.1. Security fixes
링크 복사

Red Hat AI Inference Server 3.3.1 addresses the following CVEs:

2.2. Bug fixes
링크 복사

Streaming tool calls with Mistral models returned invalid JSON: Streaming tool calls using Mistral models through the Anthropic-compatible /v1/messages endpoint returned invalid JSON, preventing clients from parsing responses. The JSON serialization for streaming tool calls is corrected. Mistral models using OpenAI-compatible endpoints and non-Mistral models using the /v1/messages endpoint were not affected.

Quantized Llama-4 models produced incorrect results: Quantization scales were not permuted correctly for attention layers in Llama-4 models, causing accuracy collapse in models such as Llama-Guard-4-12B when attention layers were quantized. Quantization scales are permuted correctly for Llama-4 attention layers.

Encoder models failed on AMD ROCm AI accelerators: The Triton attention backend did not support encoder self-attention, causing encoder-only models, encoder-decoder models, and Whisper speech-to-text models to fail on AMD ROCm AI accelerators. The Triton attention backend supports encoder self-attention.

GPT-OSS models returned empty content in multi-turn conversations: The reasoning parser incorrectly matched markers from previous messages in multi-turn conversations, causing GPT-OSS models to return content: null when using json_object response format. The reasoning parser correctly handles multi-turn conversations.

Malformed json_schema requests returned HTTP 500 instead of HTTP 400: When response_format type was json_schema but the json_schema field was missing, an assertion error caused the server to return HTTP 500 Internal Server Error instead of HTTP 400 Bad Request. The server validates the json_schema field before processing and returns the correct HTTP 400 Bad Request error.

Large Mixture of Experts models crashed due to integer overflow: An int32 overflow in fused MoE stride computation caused large Mixture of Experts models to crash or produce silent data corruption. The stride computation uses overflow-safe arithmetic.

Cascade attention caused numerical instability: Cascade attention was enabled by default and caused numerical instability in some workloads, resulting in unreliable model outputs. Cascade attention is disabled by default to match the upstream vLLM configuration.

2.3. Enhancements
링크 복사

BART encoder-decoder model support for CUDA AI accelerators: The BART plugin enables inference serving for BART-based summarization and translation models on CUDA AI accelerators.

Llama-Nemotron embedding model support: The llama-nemotron-embed-1b-v2 embedding model is supported for inference serving.

Custom encoder support for classification models: Classification models can use custom encoders for improved inference performance.

이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 2. Version 3.3.1 release notes

2.1. Security fixes
링크 복사

2.2. Bug fixes
링크 복사

2.3. Enhancements
링크 복사

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 소개

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat 문서 정보

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 2. Version 3.3.1 release notes

2.1. Security fixes링크 복사링크가 클립보드에 복사되었습니다!

2.2. Bug fixes링크 복사링크가 클립보드에 복사되었습니다!

2.3. Enhancements링크 복사링크가 클립보드에 복사되었습니다!

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 소개

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat 문서 정보

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

2.1. Security fixes
링크 복사

2.2. Bug fixes
링크 복사

2.3. Enhancements
링크 복사