Chapter 4. Evaluating the model
If you want to measure the improvements of your new model, you can compare its performance to the base model with the evaluation process. You can also chat with the model directly to qualitatively identify whether the new model has learned the knowledge you created. If you want more quantitative results of the model improvements, you can run the evaluation process in the RHEL AI CLI.
4.1. Evaluating your new model
If you want to measure the improvements of your new model, you can compare its performance to the base model with the evaluation process. You can also chat with the model directly to qualitatively identify whether the new model has learned the knowledge you created. If you want more quantitative results of the model improvements, you can run the evaluation process in the RHEL AI CLI with the following procedure.
Prerequisites
- You installed RHEL AI with the bootable container image.
-
You created a custom
qna.yaml
file with skills or knowledge. - You ran the synthetic data generation process.
- You trained the model using the RHEL AI training process.
-
You downloaded the
prometheus-8x7b-v2-0
judge model. - You have root user access on your machine.
Procedure
-
Navigate to your working Git branch where you created your
qna.yaml
file. You can now run the evaluation process on different benchmarks. Each command needs the path to the trained
samples
model to evaluate, you can access these checkpoints in your~/.local/share/instructlab/checkpoints
folder.MMLU_BRANCH benchmark - If you want to measure how your knowledge contributions have impacted your model, run the
mmlu_branch
benchmark by executing the following command:$ ilab model evaluate --benchmark mmlu_branch --model ~/.local/share/instructlab/phased/phase2/checkpoints/hf_format/<checkpoint> \ --tasks-dir ~/.local/share/instructlab/datasets/<node-dataset> \ --base-model ~/.cache/instructlab/models/granite-7b-starter
where
- <checkpoint>
- Specify the best scored checkpoint file generated during multi-phase training
- <node-dataset>
Specify the
node_datasets
directory, in the~/.local/share/instructlab/datasets/
directory, with the same timestamps as the.jsonl files used for training the model.Example output
# KNOWLEDGE EVALUATION REPORT ## BASE MODEL (SCORE) /home/user/.cache/instructlab/models/instructlab/granite-7b-lab/ (0.74/1.0) ## MODEL (SCORE) /home/user/local/share/instructlab/phased/phases2/checkpoints/hf_format/samples_665(0.78/1.0) ### IMPROVEMENTS (0.0 to 1.0): 1. tonsils: 0.74 -> 0.78 (+0.04)
Optional: MT_BENCH_BRANCH benchmark - If you want to measure how your skills contributions have impacted your model, run the
mt_bench_branch
benchmark by executing the following command:$ ilab model evaluate \ --benchmark mt_bench_branch \ --model ~/.local/share/instructlab/phased/phase2/checkpoints/hf_format/<checkpoint> \ --judge-model ~/.cache/instructlab/models/prometheus-8x7b-v2-0 \ --branch <worker-branch> \ --base-branch <worker-branch>
where
- <checkpoint>
- Specify the best scored checkpoint file generated during multi-phase training.
- <worker-branch>
- Specify the branch you used when adding data to your taxonomy tree.
- <num-gpus>
Specify the number of GPUs you want to use for evaluation.
NoteCustomizing skills is not currently supported on Red Hat Enterprise Linux AI version 1.2.
Example output
# SKILL EVALUATION REPORT ## BASE MODEL (SCORE) /home/user/.cache/instructlab/models/instructlab/granite-7b-lab (5.78/10.0) ## MODEL (SCORE) /home/user/local/share/instructlab/phased/phases2/checkpoints/hf_format/samples_665(6.00/10.0) ### IMPROVEMENTS (0.0 to 10.0): 1. foundational_skills/reasoning/linguistics_reasoning/object_identification/qna.yaml: 4.0 -> 6.67 (+2.67) 2. foundational_skills/reasoning/theory_of_mind/qna.yaml: 3.12 -> 4.0 (+0.88) 3. foundational_skills/reasoning/linguistics_reasoning/logical_sequence_of_words/qna.yaml: 9.33 -> 10.0 (+0.67) 4. foundational_skills/reasoning/logical_reasoning/tabular/qna.yaml: 5.67 -> 6.33 (+0.67) 5. foundational_skills/reasoning/common_sense_reasoning/qna.yaml: 1.67 -> 2.33 (+0.67) 6. foundational_skills/reasoning/logical_reasoning/causal/qna.yaml: 5.67 -> 6.0 (+0.33) 7. foundational_skills/reasoning/logical_reasoning/general/qna.yaml: 6.6 -> 6.8 (+0.2) 8. compositional_skills/writing/grounded/editing/content/qna.yaml: 6.8 -> 7.0 (+0.2) 9. compositional_skills/general/synonyms/qna.yaml: 4.5 -> 4.67 (+0.17) ### REGRESSIONS (0.0 to 10.0): 1. foundational_skills/reasoning/unconventional_reasoning/lower_score_wins/qna.yaml: 5.67 -> 4.0 (-1.67) 2. foundational_skills/reasoning/mathematical_reasoning/qna.yaml: 7.33 -> 6.0 (-1.33) 3. foundational_skills/reasoning/temporal_reasoning/qna.yaml: 5.67 -> 4.67 (-1.0) ### NO CHANGE (0.0 to 10.0): 1. foundational_skills/reasoning/linguistics_reasoning/odd_one_out/qna.yaml (9.33) 2. compositional_skills/grounded/linguistics/inclusion/qna.yaml (6.5)
Optional: You can manually evaluate each checkpoint using the MMLU and MT_BENCH benchmarks. You can evaluate any model against the standardized set of knowledge or skills, allowing you to compare the scores of your own model against other LLMs. If you do run multi-phase training, this process is done with single-phase training.
MMLU - If you want to see the evaluation score of your new model against a standardized set of knowledge data, set the
mmlu
benchmark by running the following command:$ ilab model evaluate --benchmark mmlu --model ~/.local/share/instructlab/phased/phase2/checkpoints/hf_format/samples_665
where
- <checkpoint>
Specify one of the checkpoint files generated during multi-phase training.
Example output
# KNOWLEDGE EVALUATION REPORT ## MODEL (SCORE) /home/user/.local/share/instructlab/phased/phase2/checkpoints/hf_format/samples_665 ### SCORES (0.0 to 1.0): mmlu_abstract_algebra - 0.31 mmlu_anatomy - 0.46 mmlu_astronomy - 0.52 mmlu_business_ethics - 0.55 mmlu_clinical_knowledge - 0.57 mmlu_college_biology - 0.56 mmlu_college_chemistry - 0.38 mmlu_college_computer_science - 0.46 ...
MT_BENCH - If you want to see the evaluation score of your new model against a standardized set of skills, set the
mt_bench
benchmark by running the following command:$ ilab model evaluate --benchmark mt_bench --model ~/.local/share/instructlab/phased/phases2/checkpoints/hf_format/samples_665
where
- <checkpoint>
Specify one of the checkpoint files generated during multi-phase training.
Example output
# SKILL EVALUATION REPORT ## MODEL (SCORE) /home/user/local/share/instructlab/phased/phases2/checkpoints/hf_format/samples_665(7.27/10.0) ### TURN ONE (0.0 to 10.0): 7.48 ### TURN TWO (0.0 to 10.0): 7.05