이 콘텐츠는 선택한 언어로 제공되지 않습니다.
Chapter 1. Red Hat Enterprise Linux AI 1.3 release notes
RHEL AI provides organizations with a process to develop enterprise applications on open source Large Language Models (LLMs).
1.1. About this release
Red Hat Enterprise Linux AI version 1.3 includes various features for Large Language Model (LLM) fine-tuning on the Red Hat and IBM produced Granite model. A customized model using the RHEL AI workflow consisted of the following:
- Install and launch a RHEL 9.4 instance with the InstructLab tooling.
- Host information in a Git repository and interact with a Git-based taxonomy of the knowledge you want a model to learn.
- Run the end-to-end workflow of synthetic data generation (SDG), multi-phase training, and benchmark evaluation.
- Serve and chat with the newly fine-tuned LLM.
1.2. Features and Enhancements
Red Hat Enterprise Linux AI version 1.3 includes various features for Large Language Model (LLM) fine-tuning.
1.2.1. Installing
Red Hat Enterprise Linux AI is installable as a bootable image. This image contains various tooling for interacting with RHEL AI. The image includes: Red Hat Enterprise Linux 9.4, Python version 3.11 and InstructLab tools for model fine-tuning. For more information about installing Red Hat Enterprise Linux AI, see Installation overview and the "Installation feature tracker"
1.2.1.1. Installing RHEL AI on Google Cloud Platform (GCP) (Generally Available)
On RHEL AI version 1.3 installing and deploying Red Hat Enterprise Linux AI on Google Cloud Platform (GCP) instances is now generally available. For the documentation on installing Red Hat Enterprise Linux AI on GCP, see Installing on Google Cloud Platform (GCP)
RHEL AI currently supports 8xA100 and 8xH100 accelerators on GCP instances for the full end-to-end workflow. You can also serve LLMs provided by Red Hat for inferencing on GCP instances. For more details on the RHEL AI hardware requirements for GCP, see Red Hat Enterprise Linux AI hardware requirements.
1.2.1.2. Red Hat Enterprise Linux AI images on Azure and AWS marketplace
On RHEL AI 1.3, you can download the RHEL AI image on Azure and AWS marketplace. You can access the image and deploy a RHEL AI AWS AMI or a RHEL AI Azure VHD.
1.2.1.3. Installing RHEL AI on Intel Gaudi3 accelerators
Red Hat Enterprise Linux AI version 1.3 supports using RHEL AI on systems with Intel Gaudi3 accelerators. You can use the Intel ISO image and deploy Red Hat Enterprise Linux AI on a bare metal system with Gaudi3 accelerators. You can access the Intel 1.3 ISO image from the the Red Hat downloads page. For more information on bare metal installations, see the Installing on bare metal documentation. For more information about RHEL AI hardware requirements for Intel, see Red Hat Enterprise Linux AI hardware requirements.
1.2.2. Building your RHEL AI environment
After installing Red Hat Enterprise Linux AI, you can set up your RHEL AI environment with the InstructLab tools.
1.2.2.1. Initializing InstructLab
You can initialize and set up your RHEL AI environment by running the ilab config init
command. This command creates the necessary configurations for interacting with RHEL AI and fine-tuning models. It also creates proper directories for your data files. For more information about initializing InstructLab, see the Initialize InstructLab documentation.
1.2.2.1.1. InstructLab system profiles
Red Hat Enterprise Linux AI version 1.3 replaces the former training profiles with system profiles. This process adds the proper parameters to the config.yaml
for your selected system. You can select your hardware vendor and accelerators directly in the CLI. For more information on system profiles, see the Initialize InstructLab documentation.
RHEL AI version 1.3 also continues the auto-detection feature, where the CLI can detect your machines hardware and select the profile that matches your system.
1.2.2.2. Downloading Large Language Models
You can download various Large Language Models (LLMs) provided by Red Hat to your RHEL AI machine or instance. You can download these models from a Red Hat registry after creating and logging in to your Red Hat registry account. For more information about the supported RHEL AI LLMs, see the Downloading models documentation and the "Large Language Models (LLMs) technology preview status".
1.2.2.2.1. granite-8b-lab-v1
and granite-8b-starter-v1
LLMs for NVIDIA and AMD
Red Hat Enterprise Linux AI version 1.3 now supports the granite-8b-redhat-lab
inference serving model and granite-8b-starter
base model for systems with NVIDIA and AMD accelerators. For the full support matrix of RHEL AI LLMs, see the Downloading models documentation.
1.2.2.2.2. granite-8b-lab-v2
LLM (Technology preview)
Red Hat Enterprise Linux AI version 1.3 offers version 2 of the granite-8b-lab
LLM as a technology preview. This model is supported as an inference only use case. For the full support matrix of RHEL AI LLMs, see the Downloading models documentation.
1.2.2.3. Serving and chatting with models
Red Hat Enterprise Linux AI version 1.3 allows you to run a vLLM inference server on various LLMs. The vLLM tool is a memory-efficient inference and serving engine library for LLMs that is included in the RHEL AI image. For more information about serving and chatting with models, see Serving and chatting with the models documentation.
1.2.3. Creating skills and knowledge YAML files
On Red Hat Enterprise Linux AI, you can customize your taxonomy tree using custom YAML files so a model can learn domain-specific information. You host your knowledge data in a Git repository and fine-tune a model with that data. For detailed documentation on how to create a knowledge markdown and YAML file, see Customizing your taxonomy tree.
1.2.3.1. PDF document support on Red Hat Enterprise Linux AI
Red Hat Enterprise Linux AI now supports various documentation types for SDG consumption. You can now specify PDF documents in your qna.yaml
files, along with the already supported markdown format. For more information on creating custom qna.yaml
files with PDF documents, see Customizing your taxonomy tree.
1.2.3.2. Creating custom skills on RHEL AI
Red Hat Enterprise Linux AI version 1.3 now supports the ability to generate a custom LLM with skills. You can now create a skill qna.yaml
file and train the granite student LLM to learn the skills and knowledge. Red Hat Enterprise Linux AI version 1.3 does not currently support training skills exclusively. You can run multi-phase training for skills and knowledge on RHEL AI version 1.3. For more information about creating custom skills, see Adding skills to your taxonomy tree.
1.2.4. Generating a custom LLM using RHEL AI
You can use Red Hat Enterprise Linux AI to customize a granite starter LLM with your domain specific skills and knowledge. RHEL AI includes the LAB enhanced method of Synthetic Data Generation (SDG) and multi-phase training.
1.2.4.1. Synthetic Data Generation (SDG)
Red Hat Enterprise Linux AI includes the LAB enhanced method of synthetic data generation (SDG). You can use the qna.yaml
files with your own knowledge data to create hundreds of artifical datasets in the SDG process. For more information about running the SDG process, see Generating a new dataset with Synthetic data generation (SDG).
1.2.4.2. Training a model with your data
Red Hat Enterprise Linux AI includes the LAB enhanced method of multi-phase training: A fine-tuning strategy where datasets are trained and evaluated in multiple phases to create the best possible model. For more details on multi-phase training, see Training your data on the model.
1.2.4.3. Benchmark evaluation
Red Hat Enterprise Linux AI includes the ability to run benchmark evaluations on the newly trained models. On your trained model, you can evaluate how well the model knows the knowledge or skills you added with the MMLU_BRANCH
or MT_BENCH_BRANCH
benchmark. For more details on benchmark evaluation, see Evaluating your new model.
1.3. Red Hat Enterprise Linux AI feature tracker
Some features in this release are currently in Technology Preview. These experimental features are not intended for production use. For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
In the following tables, features are marked with the following statuses:
- Not Available
- Technology Preview
- General Availability
- Deprecated
- Removed
1.3.1. Installation feature tracker
Feature | 1.1 | 1.2 | 1.3 |
---|---|---|---|
Installing on bare metal | Generally available | Generally available | Generally available |
Installing on AWS | Generally available | Generally available | Generally available |
Installing on IBM Cloud | Generally available | Generally available | Generally available |
Installing on GCP | Not available | Technology preview | Generally available |
Installing on Azure | Not available | Generally available | Generally available |
1.3.2. Platform support feature tracker
Feature | 1.1 | 1.2 | 1.3 |
---|---|---|---|
Bare metal | Generally available | Generally available | Generally available |
AWS | Generally available | Generally available | Generally available |
IBM Cloud | Not available | Generally available | Generally available |
Google Cloud Platform | Not available | Technology preview | Generally available |
Azure | Not available | Generally available | Generally available |
Feature | 1.1 | 1.2 | 1.3 |
---|---|---|---|
Bare metal | Generally available | Generally available | Generally available |
AWS | Generally available | Generally available | Generally available |
IBM Cloud | Generally available | Generally available | Generally available |
Google Cloud Platform (GCP) | Not available | Technology preview | Generally available |
Azure | Not available | Generally available | Generally available |
Feature | 1.1 | 1.2 | 1.3 |
---|---|---|---|
AWS | Not available | Not available | Generally available |
Azure | Not available | Not available | Generally available |
1.4. Large Language Models feature status
1.4.1. RHEL AI version 1.3 hardware vendor LLM support
Feature | NVIDIA | AMD | Intel |
---|---|---|---|
| Deprecated | Deprecated | Technology preview |
| Deprecated | Deprecated | Technology preview |
| Generally available | Technology preview | Not available |
| Generally available | Technology preview | Not available |
| Technology preview | Technology preview | Not available |
| Technology preview | Technology preview | Technology preview |
| Technology preview | Technology preview | Technology preview |
| Generally available | Technology preview | Technology preview |
| Generally available | Technology preview | Technology preview |
1.5. Known Issues
Incorrect auto-detection on some machines with AMD accelerators
Fixed in RHEL AI 1.3.1
The 1.3 version of RHEL AI sometimes auto-detects the incorrect system profile on machines with AMD accelerators.
You can select the correct system profile on AMD with the following command:
$ ilab config init --profile ~/.local/share/instructlab/internal/system_profiles/amd/mi300x/<specified-profile>
AMD-smi is not usable upon installation
After installing Red Hat Enterprise Linux AI using the ISO image or upgrading to a system using the bootc-amd-rhel9
container, the amd-smi
tool does not work by default. To enable amd-smi
, add the proper ROCm version to your user PATH
variable with the following command:
$ export PATH="$PATH:/opt/rocm-6.1.2/bin"
Training fails with the default 4xL40s max_batch_len
parameter
Fixed in RHEL AI 1.3.1
The default max_batch_len
parameter in the 4xL40s configuration needs to be updated in order for training to run properly.
You can edit the config.yaml
file with the following command:
$ ilab config edit
Edit the config.yaml
file to match the following parameter:
train: max_batch_len: 15000
Incorrect auto-detection on some NVIDIA H100 or A100 systems
Fixed for H100 accelerators in RHEL AI 1.3.1
RHEL AI sometimes auto-detects the incorrect system profile on machines with H100 or A100 accelerators.
You can select the correct profile by re-initializing and passing the correct system profile.
$ ilab config init --profile <path-to-system-profile>
Example profile selection command
$ ilab config init --profile ~/.local/share/instructlab/internal/system_profiles/nvidia/h100/h100.yaml
Upgrading to a z-stream on AMD Bare metal and NVIDIA AWS systems
On RHEL AI, there is an issue in the upgrade process if you are upgrading to a AMD bare metal or NVIDIA AWS system. To successfully update to a RHEL AI z-stream on these systems, run the following command.
Bare metal with AMD accelerators
$ sudo bootc switch registry.redhat.io/rhelai1/bootc-amd-rhel9:1.3
AWS with NVIDIA accelerators
$ sudo bootc switch registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.3
Training fails with the default 8x A100 max_batch_len
parameter
The default max_batch_len
parameter in the 8x A100 configuration needs to be updated in order for training to run properly.
You can edit the config.yaml
file with the following command:
$ ilab config edit
Edit the config.yaml
file to match the following parameter:
train: max_batch_len: 10000
Fabric manager does not always starts with NVIDIA accelerators
After installing Red Hat Enterprise Linux AI on NVIDIA systems, you may see the following error when serving or training a model.
INFO 2024-11-26 22:18:04,244 instructlab.model.serve_backend:56: Using model '/var/home/cloud-user/.local/share/instructlab/phased/phase2/checkpoints/hf_format/samples_29117' with -1 gpu-lay ers and 4096 max context size. INFO 2024-11-26 22:18:04,244 instructlab.model.serve_backend:88: '--gpus' flag used alongside '--tensor-parallel-size' in the vllm_args section of the config file. Using value of the --gpus File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 105, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/usr/lib64/python3.11/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 192, in build_async_engine_client_from_engine_args raise RuntimeError( RuntimeError: Engine process failed to start
To resolve this issue, you need the run the following commands:
$ sudo systemctl stop nvidia-persistenced.service $ sudo systemctl start nvidia-fabricmanager.service $ sudo systemctl start nvidia-persistenced.service
UI AMD technology preview installations
Red Hat Enterprise Linux AI version 1.3 currently does not support graphical based installation with the technology previewed AMD ISOs. Ensure that the text
parameter in your kickstart
file is configured for non-interactive installs. You can also pass inst.text
in your shell during interactive installation to avoid an install time crash.
SDG can fail on 4xL40s
For SDG to run on 4xL40s, you need to run SDG with the --num-cpus
flag and set to the value of 4
.
$ ilab data generate --num-cpus 4
MMLU and MMLU_BRANCH on the granite-8b-starter-v1
model
When evaluating a model built from the granite-8b-starter-v1
LLM, there might an error where vLLM does not start when running the MMLU and MMLU_BRANCH benchmarks.
If vLLM does not start, add the following parameter to the serve
section of your config.yaml
file:
serve: vllm: vllm_args: [--dtype bfloat16]
Kdump over nfs
Red Hat Enterprise Linux AI version 1.3 does not support kdump over nfs without configuration. To use this feature, run the following commands:
mkdir -p /var/lib/kdump/dracut.conf.d echo "dracutmodules=''" > /var/lib/kdump/dracut.conf.d/99-kdump.conf echo "omit_dracutmodules=''" >> /var/lib/kdump/dracut.conf.d/99-kdump.conf echo "dracut_args --confdir /var/lib/kdump/dracut.conf.d --install /usr/lib/passwd --install /usr/lib/group" >> /etc/kdump.conf systemctl restart kdump
1.6. Asynchronous z-stream updates
Security, bug fix, and enhancement updates for RHEL AI 1.3 are released as asynchronous z-stream updates.
This section will continue to be updated over time to provide notes on enhancements and bug fixes for future asynchronous z-stream releases of RHEL AI 1.3. Versioned asynchronous releases, for example with the form RHEL AI 1.3.z, will be detailed in subsections.
1.6.1. Red Hat Enterprise Linux AI 1.3.1 bug fixes
Issued: 18 December 2024
Red Hat Enterprise Linux AI release 1.3.1 is now available. This release includes bug fixes and product enhancements.
1.6.1.1. Bug fixes
-
Previously, systems with x4 L40S accelerators contained the incorrect
max_batch_len
value in their configuration. As a result, running RHEL AI multi-phase training would fail on these systems. With this release, the correctmax_batch_len
value in in the configuration and training runs with no errors. (RHELAI-2398) - Previously, systems with NVIDIA H100 accelerators were not automatically detected. As a result, the CLI prompt appears for a manual system selection instead of the hardware auto-detection. With this release, the H100 accelerators are properly detection without manual confirmation. (RHELAI-2387)
- Previously, some machines with AMD accelerators were incorrectly auto-detected. As a result, a manual initialization was required to select the correct system profile with AMD accelerators. With this release, the systems with AMD hardware are now properly auto-detected. (RHELAI-2369)
1.6.1.2. Upgrade
To update your RHEL AI system to the most recent z-stream version, you must be logged in to the Red Hat registry and run the following command:
$ sudo bootc upgrade --apply
For more information on upgrading your RHEL AI system, see the Updating Red Hat Enterprise Linux AI documentation.
There is a known issue on 1.3.1 when upgrading to bare metal machines with AMD accelerators or AWS machines with NVIDIA accelerators. To upgrade to these systems, you need to run the following command instead of the standard upgrade process.
Bare metal with AMD accelerators
$ sudo bootc switch registry.redhat.io/rhelai1/bootc-amd-rhel9:1.3
AWS with NVIDIA accelerators
$ sudo bootc switch registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.3