이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 3. Evaluating the model


If you want to measure the improvements of your new model, you can compare its performance to the base model with the evaluation process. You can also chat with the model directly to qualitatively identify whether the new model has learned the knowledge you created. If you want more quantitative results of the model improvements, you can run the evaluation process in the RHEL AI CLI.

3.1. Evaluating your new model

You can run the evaluation process in the RHEL AI CLI with the following procedure.

Prerequisites

  • You installed RHEL AI with the bootable container image.
  • You created a custom qna.yaml file with skills or knowledge.
  • You ran the synthetic data generation process.
  • You trained the model using the RHEL AI training process.
  • You downloaded the prometheus-8x7b-v2-0 judge model.
  • You have root user access on your machine.

Procedure

  1. Navigate to your working Git branch where you created your qna.yaml file.
  2. You can now run the evaluation process on different benchmarks. Each command needs the path to the trained samples model to evaluate, you can access these checkpoints in your ~/.local/share/instructlab/checkpoints folder.

    1. MMLU_BRANCH benchmark - If you want to measure how your knowledge contributions have impacted your model, run the mmlu_branch benchmark by executing the following command:

      $ ilab model evaluate --benchmark mmlu_branch
          --model ~/.local/share/instructlab/phased/phase2/checkpoints/hf_format/<checkpoint> \
          --tasks-dir ~/.local/share/instructlab/datasets/<generation-date>/<node-dataset> \
          --base-model ~/.cache/instructlab/models/granite-7b-starter

      where

      <checkpoint>
      Specify the best scored checkpoint file generated during multi-phase training
      <node-dataset>

      Specify the node_datasets directory that was generated during SDG, in the ~/.local/share/instructlab/datasets/<generation-date> directory, with the same timestamps as the.jsonl files used for training the model.

      Example output

      # KNOWLEDGE EVALUATION REPORT
      
      ## BASE MODEL (SCORE)
      /home/user/.cache/instructlab/models/instructlab/granite-7b-lab/ (0.74/1.0)
      
      ## MODEL (SCORE)
      /home/user/local/share/instructlab/phased/phases2/checkpoints/hf_format/samples_665(0.78/1.0)
      
      ### IMPROVEMENTS (0.0 to 1.0):
      1. tonsils: 0.74 -> 0.78 (+0.04)

    2. MT_BENCH_BRANCH benchmark - If you want to measure how your skills contributions have impacted your model, run the mt_bench_branch benchmark by executing the following command:

      $ ilab model evaluate \
          --benchmark mt_bench_branch \
          --model ~/.local/share/instructlab/phased/phase2/checkpoints/hf_format/<checkpoint> \
          --judge-model ~/.cache/instructlab/models/prometheus-8x7b-v2-0 \
          --branch <worker-branch> \
          --base-branch <worker-branch>

      where

      <checkpoint>
      Specify the best scored checkpoint file generated during multi-phase training.
      <worker-branch>
      Specify the branch you used when adding data to your taxonomy tree.
      <num-gpus>

      Specify the number of GPUs you want to use for evaluation.

      Example output

      # SKILL EVALUATION REPORT
      
      ## BASE MODEL (SCORE)
      /home/user/.cache/instructlab/models/instructlab/granite-7b-lab (5.78/10.0)
      
      ## MODEL (SCORE)
      /home/user/local/share/instructlab/phased/phases2/checkpoints/hf_format/samples_665(6.00/10.0)
      
      ### IMPROVEMENTS (0.0 to 10.0):
      1. foundational_skills/reasoning/linguistics_reasoning/object_identification/qna.yaml: 4.0 -> 6.67 (+2.67)
      2. foundational_skills/reasoning/theory_of_mind/qna.yaml: 3.12 -> 4.0 (+0.88)
      3. foundational_skills/reasoning/linguistics_reasoning/logical_sequence_of_words/qna.yaml: 9.33 -> 10.0 (+0.67)
      4. foundational_skills/reasoning/logical_reasoning/tabular/qna.yaml: 5.67 -> 6.33 (+0.67)
      5. foundational_skills/reasoning/common_sense_reasoning/qna.yaml: 1.67 -> 2.33 (+0.67)
      6. foundational_skills/reasoning/logical_reasoning/causal/qna.yaml: 5.67 -> 6.0 (+0.33)
      7. foundational_skills/reasoning/logical_reasoning/general/qna.yaml: 6.6 -> 6.8 (+0.2)
      8. compositional_skills/writing/grounded/editing/content/qna.yaml: 6.8 -> 7.0 (+0.2)
      9. compositional_skills/general/synonyms/qna.yaml: 4.5 -> 4.67 (+0.17)
      
      ### REGRESSIONS (0.0 to 10.0):
      1. foundational_skills/reasoning/unconventional_reasoning/lower_score_wins/qna.yaml: 5.67 -> 4.0 (-1.67)
      2. foundational_skills/reasoning/mathematical_reasoning/qna.yaml: 7.33 -> 6.0 (-1.33)
      3. foundational_skills/reasoning/temporal_reasoning/qna.yaml: 5.67 -> 4.67 (-1.0)
      
      ### NO CHANGE (0.0 to 10.0):
      1. foundational_skills/reasoning/linguistics_reasoning/odd_one_out/qna.yaml (9.33)
      2. compositional_skills/grounded/linguistics/inclusion/qna.yaml (6.5)

  3. Optional: You can manually evaluate each checkpoint using the MMLU and MT_BENCH benchmarks. You can evaluate any model against the standardized set of knowledge or skills, allowing you to compare the scores of your own model against other LLMs.

    1. MMLU - If you want to see the evaluation score of your new model against a standardized set of knowledge data, set the mmlu benchmark by running the following command:

      $ ilab model evaluate --benchmark mmlu --model ~/.local/share/instructlab/phased/phase2/checkpoints/hf_format/samples_665 --skip-server

      where

      <checkpoint>

      Specify one of the checkpoint files generated during multi-phase training.

      Example output

      # KNOWLEDGE EVALUATION REPORT
      
      ## MODEL (SCORE)
      /home/user/.local/share/instructlab/phased/phase2/checkpoints/hf_format/samples_665
      
      ### SCORES (0.0 to 1.0):
      mmlu_abstract_algebra - 0.31
      mmlu_anatomy - 0.46
      mmlu_astronomy - 0.52
      mmlu_business_ethics - 0.55
      mmlu_clinical_knowledge - 0.57
      mmlu_college_biology - 0.56
      mmlu_college_chemistry - 0.38
      mmlu_college_computer_science - 0.46
      ...

    2. MT_BENCH - If you want to see the evaluation score of your new model against a standardized set of skills, set the mt_bench benchmark by running the following command:

      $ ilab model evaluate --benchmark mt_bench --model ~/.local/share/instructlab/phased/phases2/checkpoints/hf_format/samples_665

      where

      <checkpoint>

      Specify one of the checkpoint files generated during multi-phase training.

      Example output

      # SKILL EVALUATION REPORT
      
      ## MODEL (SCORE)
      /home/user/local/share/instructlab/phased/phases2/checkpoints/hf_format/samples_665(7.27/10.0)
      
      ### TURN ONE (0.0 to 10.0):
      7.48
      
      ### TURN TWO (0.0 to 10.0):
      7.05

3.1.1. Domain-Knowledge benchmark evaluation

Important

Domain-Knowledge benchmark evaluation is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

The current knowledge evaluation benchmark in RHEL AI, MMLU and MMLU_branch, evaluates models on their ability to answer multiple choice questions. There was no way to give the model credit on moderately correct or incorrect answers.

The Domain-Knowledge benchmark (DK-bench) evaluation provides the ability to bring custom evaluation questions and score the models answers on a scale.

Each response given is compared to the reference answer and graded on the following scale by the judge model:

Expand
Table 3.1. Domain-Knowledge benchmark rubric
ScoreCriteria

1

The response is entirely incorrect, irrelevant, or does not align with the reference in any meaningful way.

2

The response partially matches the reference but contains major errors, significant omissions, or irrelevant information.

3

The response aligns with the reference overall but lacks sufficient detail, clarity, or contains minor inaccuracies.

4

The response is mostly accurate, aligns closely with the reference, and contains only minor issues or omissions.

5

The response is fully accurate, completely aligns with the reference, and is clear, thorough, and detailed.

Prerequisites

  • You installed RHEL AI with the bootable container image.
  • You trained the model using the RHEL AI training process.
  • You downloaded the prometheus-8x7b-v2-0 judge model.
  • You have root user access on your machine.

Procedure

  1. To utilize custom evaluation, you must create a jsonl file that includes every question you want to ask a model to answer and evaluate.

    Example DK-bench jsonl file

    {"user_input":"What is the capital of Canada?","reference":"The capital of Canada is Ottawa."}

    where

    user_input
    Contains the question for the model.
    reference
    Contains the answer to the question.
  2. To run the DK-bench benchmark with your custom evaluation questions, run the following command:

    $ ilab model evaluate --benchmark dk_bench --input-questions <path-to-jsonl-file> --model <path-to-model>

    where

    <path-to-jsonl-file>
    Specify the path to your jsonl file that contains your questions and answers.
    <path-to-model>

    Specify the path to the model you want to evaluate.

    Example command

    $ ilab model evaluate --benchmark dk_bench --input-questions /home/use/path/to/questions.jsonl --model ~/.cache/instructlab/models/instructlab/granite-7b-lab

    Example output of domain-Knowledge benchmark evaluation

    # DK-BENCH REPORT
    
    ## MODEL: granite-7b-lab
    
    Question #1:     5/5
    Question #2:     5/5
    Question #3:     5/5
    Question #4:     5/5
    Question #5:     2/5
    Question #6:     3/5
    Question #7:     2/5
    Question #8:     3/5
    Question #9:     5/5
    Question #10:     5/5
    ----------------------------
    Average Score:   4.00/5
    Total Score:     40/50

Red Hat logoGithubredditYoutubeTwitter

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 문서 정보

Red Hat을 사용하는 고객은 신뢰할 수 있는 콘텐츠가 포함된 제품과 서비스를 통해 혁신하고 목표를 달성할 수 있습니다. 최신 업데이트를 확인하세요.

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat은 코드, 문서, 웹 속성에서 문제가 있는 언어를 교체하기 위해 최선을 다하고 있습니다. 자세한 내용은 다음을 참조하세요.Red Hat 블로그.

Red Hat 소개

Red Hat은 기업이 핵심 데이터 센터에서 네트워크 에지에 이르기까지 플랫폼과 환경 전반에서 더 쉽게 작업할 수 있도록 강화된 솔루션을 제공합니다.

Theme

© 2026 Red Hat
맨 위로 이동