6.6. LM-Eval 场景

以下流程概述了对 LM-Eval 设置非常有用的示例场景。

6.6.1. 使用环境变量令牌访问 Hugging Face 模型
复制链接

如果 LMEvalJob 需要通过访问令牌访问 HuggingFace 上的模型，您可以将 HF_TOKEN 设置为 lm-eval 容器之一。

先决条件

您已登陆到 Red Hat OpenShift AI。
集群管理员已安装了 OpenShift AI，并为部署模型的数据科学项目启用了 TrustyAI 服务。

流程

要为 huggingface 模型启动评估作业，请通过 CLI 将以下 YAML 文件应用到您的数据科学项目：

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  model: hf
  modelArgs:
  - name: pretrained
    value: huggingfacespace/model
  taskList:
    taskNames:
    - unfair_tos/
  logSamples: true
  pod:
    container:
      env:
      - name: HF_TOKEN
        value: "My HuggingFace token"

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  model: hf
  modelArgs:
  - name: pretrained
    value: huggingfacespace/model
  taskList:
    taskNames:
    - unfair_tos/
  logSamples: true
  pod:
    container:
      env:
      - name: HF_TOKEN
        value: "My HuggingFace token"

Copy to Clipboard

Toggle word wrap

例如：

oc apply -f <yaml_file> -n <project_name>

$ oc apply -f <yaml_file> -n <project_name>

Copy to Clipboard

Toggle word wrap

（可选）您还可以创建一个 secret 来存储令牌，然后使用以下引用语法从 secretKeyRef 对象中引用密钥：

env:
  - name: HF_TOKEN
    valueFrom:
      secretKeyRef:
        name: my-secret
        key: hf-token

env:
  - name: HF_TOKEN
    valueFrom:
      secretKeyRef:
        name: my-secret
        key: hf-token

Copy to Clipboard

Toggle word wrap

6.6.2. 使用自定义 Unitxt 卡
复制链接

您可以使用自定义 Unitxt 卡运行评估。要做到这一点，请在 LMEvalJob YAML 中以 JSON 格式包括自定义 Unitxt 卡。

先决条件

您已登陆到 Red Hat OpenShift AI。
集群管理员已安装了 OpenShift AI，并为部署模型的数据科学项目启用了 TrustyAI 服务。

流程

以 JSON 格式传递自定义 Unitxt 卡：

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  model: hf
  modelArgs:
  - name: pretrained
    value: google/flan-t5-base
  taskList:
    taskRecipes:
    - template: "templates.classification.multi_class.relation.default"
      card:
        custom: |
          {
            "__type__": "task_card",
            "loader": {
              "__type__": "load_hf",
              "path": "glue",
              "name": "wnli"
            },
            "preprocess_steps": [
              {
                "__type__": "split_random_mix",
                "mix": {
                  "train": "train[95%]",
                  "validation": "train[5%]",
                  "test": "validation"
                }
              },
              {
                "__type__": "rename",
                "field": "sentence1",
                "to_field": "text_a"
              },
              {
                "__type__": "rename",
                "field": "sentence2",
                "to_field": "text_b"
              },
              {
                "__type__": "map_instance_values",
                "mappers": {
                  "label": {
                    "0": "entailment",
                    "1": "not entailment"
                  }
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "classes": [
                    "entailment",
                    "not entailment"
                  ]
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "type_of_relation": "entailment"
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "text_a_type": "premise"
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "text_b_type": "hypothesis"
                }
              }
            ],
            "task": "tasks.classification.multi_class.relation",
            "templates": "templates.classification.multi_class.relation.all"
          }
  logSamples: true

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  model: hf
  modelArgs:
  - name: pretrained
    value: google/flan-t5-base
  taskList:
    taskRecipes:
    - template: "templates.classification.multi_class.relation.default"
      card:
        custom: |
          {
            "__type__": "task_card",
            "loader": {
              "__type__": "load_hf",
              "path": "glue",
              "name": "wnli"
            },
            "preprocess_steps": [
              {
                "__type__": "split_random_mix",
                "mix": {
                  "train": "train[95%]",
                  "validation": "train[5%]",
                  "test": "validation"
                }
              },
              {
                "__type__": "rename",
                "field": "sentence1",
                "to_field": "text_a"
              },
              {
                "__type__": "rename",
                "field": "sentence2",
                "to_field": "text_b"
              },
              {
                "__type__": "map_instance_values",
                "mappers": {
                  "label": {
                    "0": "entailment",
                    "1": "not entailment"
                  }
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "classes": [
                    "entailment",
                    "not entailment"
                  ]
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "type_of_relation": "entailment"
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "text_a_type": "premise"
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "text_b_type": "hypothesis"
                }
              }
            ],
            "task": "tasks.classification.multi_class.relation",
            "templates": "templates.classification.multi_class.relation.all"
          }
  logSamples: true

Copy to Clipboard

Toggle word wrap

在自定义卡中，指定 Hugging Face dataset 加载程序：

"loader": {
              "__type__": "load_hf",
              "path": "glue",
              "name": "wnli"
            },

"loader": {
              "__type__": "load_hf",
              "path": "glue",
              "name": "wnli"
            },

Copy to Clipboard

Toggle word wrap

（可选）您可以使用其他 Unitxt 加载程序（可在 Unitxt 网站中找到），其中包含 卷和 volumeMounts 参数从持久性卷挂载数据集。例如，如果使用 LoadCSV Unitxt 命令，请将文件挂载到容器，并使 dataset 可供评估过程访问。

6.6.3. 使用 PVC 作为存储
复制链接

要将 PVC 用作 LMEvalJob 结果的存储，您可以使用受管 PVC 或现有的 PVC。管理的 PVC 由 TrustyAI operator 管理。现有 PVC 由最终用户创建，然后再创建 LMEvalJob。

注意

如果在输出中同时引用受管和现有的 PVC，则 TrustyAI operator 默认为受管 PVC。

先决条件

您已登陆到 Red Hat OpenShift AI。
集群管理员已安装了 OpenShift AI，并为部署模型的数据科学项目启用了 TrustyAI 服务。

6.6.3.1. 管理的 PVC
复制链接

要创建受管 PVC，请指定其大小。受管 PVC 名为 &lt ;job-name>-pvc，在作业完成后可用。删除 LMEvalJob 时，受管 PVC 也会被删除。

流程

输入以下代码：

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  # other fields omitted ...
  outputs:
    pvcManaged:
      size: 5Gi

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  # other fields omitted ...
  outputs:
    pvcManaged:
      size: 5Gi

Copy to Clipboard

Toggle word wrap

代码备注

输出 是指定自定义存储位置的部分
pvcManaged 将创建一个 Operator 管理的 PVC
大小 （与标准 PVC 语法兼容）是唯一支持的值

6.6.3.2. 现有 PVC
复制链接

要使用现有的 PVC，将其名称作为引用传递。创建 LMEvalJob 时 PVC 必须存在。PVC 不是由 TrustyAI 操作器管理，因此在删除 LMEvalJob 后可用。

流程

创建 PVC。例如：

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: "my-pvc"
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: "my-pvc"
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

Copy to Clipboard

Toggle word wrap

引用 LMEvalJob 中的新 PVC。

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  # other fields omitted ...
  outputs:
    pvcName: "my-pvc"

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  # other fields omitted ...
  outputs:
    pvcName: "my-pvc"

Copy to Clipboard

Toggle word wrap

6.6.4. 使用 KServe Inference 服务
复制链接

要在命名空间中已部署并运行的 InferenceService 上运行评估作业，请定义您的 LMEvalJob CR，然后将此 CR 应用到与模型相同的命名空间中。

注意

以下示例只适用于 Hugging Face 或 vLLM 的模型运行时。

先决条件

您已登陆到 Red Hat OpenShift AI。
集群管理员已安装了 OpenShift AI，并为部署模型的数据科学项目启用了 TrustyAI 服务。
您有一个包含带有 vLLM 模型的 InferenceService 的命名空间。本例假定集群中已部署了 vLLM 模型。
您的集群配置了域名系统(DNS)。

流程

定义 LMEvalJob CR:

  apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob
spec:
  model: local-completions
  taskList:
    taskNames:
      - mmlu
  logSamples: true
  batchSize: 1
  modelArgs:
    - name: model
      value: granite
    - name: base_url
      value: $ROUTE_TO_MODEL/v1/completions
    - name: num_concurrent
      value:  "1"
    - name: max_retries
      value:  "3"
    - name: tokenized_requests
      value: false
    - name: tokenizer
      value: huggingfacespace/model
 env:
   - name: OPENAI_TOKEN
     valueFrom:
          secretKeyRef:
            name: <secret-name>
            key: token

  apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob
spec:
  model: local-completions
  taskList:
    taskNames:
      - mmlu
  logSamples: true
  batchSize: 1
  modelArgs:
    - name: model
      value: granite
    - name: base_url
      value: $ROUTE_TO_MODEL/v1/completions
    - name: num_concurrent
      value:  "1"
    - name: max_retries
      value:  "3"
    - name: tokenized_requests
      value: false
    - name: tokenizer
      value: huggingfacespace/model
 env:
   - name: OPENAI_TOKEN
     valueFrom:
          secretKeyRef:
            name: <secret-name>
            key: token

Copy to Clipboard

Toggle word wrap

将此 CR 应用到与您的模型相同的命名空间中。

验证

在名为 evaljob 的模型命名空间中启动 pod。在 pod 终端中，您可以通过 tail -f output/stderr.log 来查看输出。

代码备注

BASE_ URL 应设置为模型的路由/服务 URL。确保在 URL 中包含 /v1/completions 端点。
env.valueFrom.secretKeyRef.name 应该指向包含可向模型进行身份验证的令牌的机密。secretRef.name 应该是命名空间中的 secret 名称，而 secretRef.key 应该指向 secret 中的令牌密钥。

secretKeyRef.name 可以等于以下的输出：

oc get secrets -o custom-columns=SECRET:.metadata.name --no-headers | grep user-one-token

oc get secrets -o custom-columns=SECRET:.metadata.name --no-headers | grep user-one-token

Copy to Clipboard

Toggle word wrap

secretKeyRef.key 设置为 token

6.6.5. 设置 LM-Eval S3 支持
复制链接

了解如何为您的 LM-Eval 服务设置 S3 支持。

先决条件

您已登陆到 Red Hat OpenShift AI。
集群管理员已安装了 OpenShift AI，并为部署模型的数据科学项目启用了 TrustyAI 服务。
您有一个包含 S3 兼容存储服务和存储桶的命名空间。
您已创建了 LMEvalJob，它引用包含模型和数据集的 S3 存储桶。
您有一个 S3 存储桶，其中包含模型文件和要评估的数据集。

流程

创建包含 S3 连接详情的 Kubernetes Secret：

apiVersion: v1
kind: Secret
metadata:
    name: "s3-secret"
    namespace: test
    labels:
        opendatahub.io/dashboard: "true"
        opendatahub.io/managed: "true"
    annotations:
        opendatahub.io/connection-type: s3
        openshift.io/display-name: "S3 Data Connection - LMEval"
data:
    AWS_ACCESS_KEY_ID: BASE64_ENCODED_ACCESS_KEY  # Replace with your key
    AWS_SECRET_ACCESS_KEY: BASE64_ENCODED_SECRET_KEY  # Replace with your key
    AWS_S3_BUCKET: BASE64_ENCODED_BUCKET_NAME  # Replace with your bucket name
    AWS_S3_ENDPOINT: BASE64_ENCODED_ENDPOINT  # Replace with your endpoint URL (for example,  https://s3.amazonaws.com)
    AWS_DEFAULT_REGION: BASE64_ENCODED_REGION  # Replace with your region
type: Opaque

apiVersion: v1
kind: Secret
metadata:
    name: "s3-secret"
    namespace: test
    labels:
        opendatahub.io/dashboard: "true"
        opendatahub.io/managed: "true"
    annotations:
        opendatahub.io/connection-type: s3
        openshift.io/display-name: "S3 Data Connection - LMEval"
data:
    AWS_ACCESS_KEY_ID: BASE64_ENCODED_ACCESS_KEY  # Replace with your key
    AWS_SECRET_ACCESS_KEY: BASE64_ENCODED_SECRET_KEY  # Replace with your key
    AWS_S3_BUCKET: BASE64_ENCODED_BUCKET_NAME  # Replace with your bucket name
    AWS_S3_ENDPOINT: BASE64_ENCODED_ENDPOINT  # Replace with your endpoint URL (for example,  https://s3.amazonaws.com)
    AWS_DEFAULT_REGION: BASE64_ENCODED_REGION  # Replace with your region
type: Opaque

Copy to Clipboard

Toggle word wrap

注意

所有值都必须采用 base64 编码。例如： echo -n "my-bucket" | base64

部署引用包含模型和数据集的 S3 存储桶的 LMEvalJob CR：

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
    name: evaljob-sample
spec:
    allowOnline: false
    model: hf  # Model type (HuggingFace in this example)
    modelArgs:
        - name: pretrained
          value: /opt/app-root/src/hf_home/flan  # Path where model is mounted in container
    taskList:
        taskNames:
            - arc_easy  # The evaluation task to run
    logSamples: true
    offline:
        storage:
            s3:
                accessKeyId:
                    name: s3-secret
                    key: AWS_ACCESS_KEY_ID
                secretAccessKey:
                    name: s3-secret
                    key: AWS_SECRET_ACCESS_KEY
                bucket:
                    name: s3-secret
                    key: AWS_S3_BUCKET
                endpoint:
                    name: s3-secret
                    key: AWS_S3_ENDPOINT
                region:
                    name: s3-secret
                    key: AWS_DEFAULT_REGION
                path: ""  # Optional subfolder within bucket
                verifySSL: false

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
    name: evaljob-sample
spec:
    allowOnline: false
    model: hf  # Model type (HuggingFace in this example)
    modelArgs:
        - name: pretrained
          value: /opt/app-root/src/hf_home/flan  # Path where model is mounted in container
    taskList:
        taskNames:
            - arc_easy  # The evaluation task to run
    logSamples: true
    offline:
        storage:
            s3:
                accessKeyId:
                    name: s3-secret
                    key: AWS_ACCESS_KEY_ID
                secretAccessKey:
                    name: s3-secret
                    key: AWS_SECRET_ACCESS_KEY
                bucket:
                    name: s3-secret
                    key: AWS_S3_BUCKET
                endpoint:
                    name: s3-secret
                    key: AWS_S3_ENDPOINT
                region:
                    name: s3-secret
                    key: AWS_DEFAULT_REGION
                path: ""  # Optional subfolder within bucket
                verifySSL: false

Copy to Clipboard

Toggle word wrap

重要

The `LMEvalJob` will copy all the files from the specified bucket/path. If your bucket contains many files and you only want to use a subset, set the `path` field to the specific sub-folder containing the files that you require. For example use `path: "my-models/"`.

The `LMEvalJob` will copy all the files from the specified bucket/path. If your bucket contains many files and you only want to use a subset, set the `path` field to the specific sub-folder containing the files that you require. For example use `path: "my-models/"`.

Copy to Clipboard

Toggle word wrap

使用 SSL 设置安全连接。

使用您的 CA 证书创建 ConfigMap 对象：

apiVersion: v1
kind: ConfigMap
metadata:
  name: s3-ca-cert
  namespace: test
  annotations:
    service.beta.openshift.io/inject-cabundle: "true"  # For injection
data: {}  # OpenShift will inject the service CA bundle
# Or add your custom CA:
# data:
#   ca.crt: |-
#     -----BEGIN CERTIFICATE-----
#     ...your CA certificate content...
#     -----END CERTIFICATE-----

apiVersion: v1
kind: ConfigMap
metadata:
  name: s3-ca-cert
  namespace: test
  annotations:
    service.beta.openshift.io/inject-cabundle: "true"  # For injection
data: {}  # OpenShift will inject the service CA bundle
# Or add your custom CA:
# data:
#   ca.crt: |-
#     -----BEGIN CERTIFICATE-----
#     ...your CA certificate content...
#     -----END CERTIFICATE-----

Copy to Clipboard

Toggle word wrap

更新 LMEvalJob 以使用 SSL 验证：

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
    name: evaljob-sample
spec:
    # ... same as above ...
    offline:
        storage:
            s3:
                # ... same as above ...
                verifySSL: true  # Enable SSL verification
                caBundle:
                    name: s3-ca-cert  # ConfigMap name containing your CA
                    key: service-ca.crt  # Key in ConfigMap containing the certificate

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
    name: evaljob-sample
spec:
    # ... same as above ...
    offline:
        storage:
            s3:
                # ... same as above ...
                verifySSL: true  # Enable SSL verification
                caBundle:
                    name: s3-ca-cert  # ConfigMap name containing your CA
                    key: service-ca.crt  # Key in ConfigMap containing the certificate

Copy to Clipboard

Toggle word wrap

验证

部署 LMEvalJob 后，打开 kubectl 命令行，再输入以下命令来检查其状态： kubectl logs -n test job/evaljob-sample -n test
使用 kubectl 命令 kubectl logs -n test job/<job-name& gt; 查看日志，以确保它正常工作。
评估完成后会在日志中显示结果。

6.6.6. 使用带有 LM-Eval 的 LLM-as-a-Judge 指标
复制链接

您可以使用大型语言模型(LLM)评估来自另一个 LLM 的输出质量，称为 LLM-as-a-Judge (LLMaaJ)。

您可以使用 LLMaaJ 进行：

评估没有清晰的正确答案，如创造性编写。
判断质量特征，例如有帮助性、安全和深度等。
增强用于评估模型性能的传统定量措施（如 ROUGE 指标）。
测试模型输出的特定质量方面。

按照下面的自定义质量评估示例，了解更多有关在 LM-Eval 中使用您自己的指标标准的信息，以评估模型响应。

本例使用 Unitxt 定义自定义指标，并查看模型(flan-t5-small)如何回答来自 MT-Bench （标准基准）的问题。Mistral-7B 模型中的自定义评估标准和说明用于根据帮助性、准确性和详情对来自 1-10 的回答进行评级。

先决条件

您已登陆到 Red Hat OpenShift AI。
您已下载并安装 OpenShift 命令行界面 (CLI)。请参阅安装 OpenShift CLI。
集群管理员已安装了 OpenShift AI，并为部署模型的数据科学项目启用了 TrustyAI 服务。
您熟悉如何使用 Unitxt。

您已设置以下参数：

Expand

表 6.8. 参数
参数	描述
自定义模板	告诉 judge 以标准格式根据特定标准分配 1 到 10 之间的分数。
`processors.extract_mt_bench_rating_judgment`	从 judge 的响应中拉取数字评级。
`formats.models.mistral.instruction`	格式化提示 Mistral 模型。
自定义 LLM-as-judge 指标	使用带有自定义指令的 Mistral-7B。

流程

在一个终端窗口中，如果您还没有以集群管理员登录到 OpenShift 集群，请登录 OpenShift CLI，如下例所示：
```
oc login <openshift_cluster_url> -u <admin_username> -p <password>
```
```
$ oc login <openshift_cluster_url> -u <admin_username> -p <password>
```
Copy to Clipboard Toggle word wrap

通过将以下指令应用到 redhat-ods-applications 命名空间，更新 TrustyAI 配置以允许在线模型访问和代码执行。

oc patch configmap trustyai-service-operator-config -n redhat-ods-applications  \
--type merge -p '{"metadata": {"annotations": {"opendatahub.io/managed": "false"}}}'

oc patch configmap trustyai-service-operator-config -n redhat-ods-applications \
--type merge -p '{"data":{"lmes-allow-online":"true","lmes-allow-code-execution":"true"}}'

oc patch configmap trustyai-service-operator-config -n redhat-ods-applications  \
--type merge -p '{"metadata": {"annotations": {"opendatahub.io/managed": "false"}}}'

oc patch configmap trustyai-service-operator-config -n redhat-ods-applications \
--type merge -p '{"data":{"lmes-allow-online":"true","lmes-allow-code-execution":"true"}}'

Copy to Clipboard

Toggle word wrap

使用 oc apply -f - 命令应用以下清单。YAML 内容定义了自定义评估作业(LMEvalJob)、命名空间和您要评估模型的位置。YAML 包含以下说明：
1. 要评估的模型。
2. 要使用的数据。
3. 如何格式化输入和输出。
4. 要使用哪个 judge 模型。
5. 如何提取和记录结果。
  注意
  您还可以使用文本编辑器将 YAML 清单放入文件中，然后使用 oc apply -f file.yaml 命令应用它。

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
 name: custom-eval
 namespace: test
spec:
 allowOnline: true
 allowCodeExecution: true
 model: hf
 modelArgs:
   - name: pretrained
     value: google/flan-t5-small
taskList:
 taskRecipes:
     - card:
         custom: |
           {
               "__type__": "task_card",
               "loader": {
                   "__type__": "load_hf",
                   "path": "OfirArviv/mt_bench_single_score_gpt4_judgement",
                   "split": "train"
               },
               "preprocess_steps": [
                   {
                       "__type__": "rename_splits",
                       "mapper": {
                           "train": "test"
                       }
                   },
                   {
                       "__type__": "filter_by_condition",
                       "values": {
                           "turn": 1
                       },
                       "condition": "eq"
                   },
                   {
                       "__type__": "filter_by_condition",
                       "values": {
                           "reference": "[]"
                       },
                       "condition": "eq"
                   },
                   {
                       "__type__": "rename",
                       "field_to_field": {
                           "model_input": "question",
                           "score": "rating",
                           "category": "group",
                           "model_output": "answer"
                       }
                   },
                   {
                       "__type__": "literal_eval",
                       "field": "question"
                   },
                   {
                       "__type__": "copy",
                       "field": "question/0",
                       "to_field": "question"
                   },
                   {
                       "__type__": "literal_eval",
                       "field": "answer"
                   },
                   {
                       "__type__": "copy",
                       "field": "answer/0",
                       "to_field": "answer"
                   }
               ],
               "task": "tasks.response_assessment.rating.single_turn",
               "templates": [
                   "templates.response_assessment.rating.mt_bench_single_turn"
               ]
           }
       template:
         ref: response_assessment.rating.mt_bench_single_turn
       format: formats.models.mistral.instruction
       metrics:
       - ref: llmaaj_metric
   custom:
     templates:
       - name: response_assessment.rating.mt_bench_single_turn
         value: |
           {
               "__type__": "input_output_template",
               "instruction": "Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: \"[[rating]]\", for example: \"Rating: [[5]]\".\n\n",
               "input_format": "[Question]\n{question}\n\n[The Start of Assistant's Answer]\n{answer}\n[The End of Assistant's Answer]",
               "output_format": "[[{rating}]]",
               "postprocessors": [
                   "processors.extract_mt_bench_rating_judgment"
               ]
           }
     tasks:
       - name: response_assessment.rating.single_turn
         value: |
           {
               "__type__": "task",
               "input_fields": {
                   "question": "str",
                   "answer": "str"
               },
               "outputs": {
                   "rating": "float"
               },
               "metrics": [
                   "metrics.spearman"
               ]
           }
     metrics:
       - name: llmaaj_metric
         value: |
           {
               "__type__": "llm_as_judge",
               "inference_model": {
                   "__type__": "hf_pipeline_based_inference_engine",
                   "model_name": "mistralai/Mistral-7B-Instruct-v0.2",
                   "max_new_tokens": 256,
                   "use_fp16": true
               },
               "template": "templates.response_assessment.rating.mt_bench_single_turn",
               "task": "rating.single_turn",
               "format": "formats.models.mistral.instruction",
               "main_score": "mistral_7b_instruct_v0_2_huggingface_template_mt_bench_single_turn"
           }
 logSamples: true
 pod:
   container:
     env:
       - name: HF_TOKEN
         valueFrom:
           secretKeyRef:
             name: hf-token-secret
             key: token
     resources:
       limits:
         cpu: '2'
         memory: 16Gi

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
 name: custom-eval
 namespace: test
spec:
 allowOnline: true
 allowCodeExecution: true
 model: hf
 modelArgs:
   - name: pretrained
     value: google/flan-t5-small
taskList:
 taskRecipes:
     - card:
         custom: |
           {
               "__type__": "task_card",
               "loader": {
                   "__type__": "load_hf",
                   "path": "OfirArviv/mt_bench_single_score_gpt4_judgement",
                   "split": "train"
               },
               "preprocess_steps": [
                   {
                       "__type__": "rename_splits",
                       "mapper": {
                           "train": "test"
                       }
                   },
                   {
                       "__type__": "filter_by_condition",
                       "values": {
                           "turn": 1
                       },
                       "condition": "eq"
                   },
                   {
                       "__type__": "filter_by_condition",
                       "values": {
                           "reference": "[]"
                       },
                       "condition": "eq"
                   },
                   {
                       "__type__": "rename",
                       "field_to_field": {
                           "model_input": "question",
                           "score": "rating",
                           "category": "group",
                           "model_output": "answer"
                       }
                   },
                   {
                       "__type__": "literal_eval",
                       "field": "question"
                   },
                   {
                       "__type__": "copy",
                       "field": "question/0",
                       "to_field": "question"
                   },
                   {
                       "__type__": "literal_eval",
                       "field": "answer"
                   },
                   {
                       "__type__": "copy",
                       "field": "answer/0",
                       "to_field": "answer"
                   }
               ],
               "task": "tasks.response_assessment.rating.single_turn",
               "templates": [
                   "templates.response_assessment.rating.mt_bench_single_turn"
               ]
           }
       template:
         ref: response_assessment.rating.mt_bench_single_turn
       format: formats.models.mistral.instruction
       metrics:
       - ref: llmaaj_metric
   custom:
     templates:
       - name: response_assessment.rating.mt_bench_single_turn
         value: |
           {
               "__type__": "input_output_template",
               "instruction": "Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: \"[[rating]]\", for example: \"Rating: [[5]]\".\n\n",
               "input_format": "[Question]\n{question}\n\n[The Start of Assistant's Answer]\n{answer}\n[The End of Assistant's Answer]",
               "output_format": "[[{rating}]]",
               "postprocessors": [
                   "processors.extract_mt_bench_rating_judgment"
               ]
           }
     tasks:
       - name: response_assessment.rating.single_turn
         value: |
           {
               "__type__": "task",
               "input_fields": {
                   "question": "str",
                   "answer": "str"
               },
               "outputs": {
                   "rating": "float"
               },
               "metrics": [
                   "metrics.spearman"
               ]
           }
     metrics:
       - name: llmaaj_metric
         value: |
           {
               "__type__": "llm_as_judge",
               "inference_model": {
                   "__type__": "hf_pipeline_based_inference_engine",
                   "model_name": "mistralai/Mistral-7B-Instruct-v0.2",
                   "max_new_tokens": 256,
                   "use_fp16": true
               },
               "template": "templates.response_assessment.rating.mt_bench_single_turn",
               "task": "rating.single_turn",
               "format": "formats.models.mistral.instruction",
               "main_score": "mistral_7b_instruct_v0_2_huggingface_template_mt_bench_single_turn"
           }
 logSamples: true
 pod:
   container:
     env:
       - name: HF_TOKEN
         valueFrom:
           secretKeyRef:
             name: hf-token-secret
             key: token
     resources:
       limits:
         cpu: '2'
         memory: 16Gi

Copy to Clipboard

Toggle word wrap

验证

处理器从 judge 的自然语言响应中提取数字评级。最终的结果可作为 LMEval 作业自定义资源(CR)的一部分提供。

6.6. LM-Eval 场景

6.6.1. 使用环境变量令牌访问 Hugging Face 模型
复制链接

6.6.2. 使用自定义 Unitxt 卡
复制链接

6.6.3. 使用 PVC 作为存储
复制链接

6.6.3.1. 管理的 PVC
复制链接

6.6.3.2. 现有 PVC
复制链接

6.6.4. 使用 KServe Inference 服务
复制链接

6.6.5. 设置 LM-Eval S3 支持
复制链接

6.6.6. 使用带有 LM-Eval 的 LLM-as-a-Judge 指标
复制链接

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

6.6. LM-Eval 场景

6.6.1. 使用环境变量令牌访问 Hugging Face 模型复制链接链接已复制到粘贴板!

6.6.2. 使用自定义 Unitxt 卡复制链接链接已复制到粘贴板!

6.6.3. 使用 PVC 作为存储复制链接链接已复制到粘贴板!

6.6.3.1. 管理的 PVC复制链接链接已复制到粘贴板!

6.6.3.2. 现有 PVC复制链接链接已复制到粘贴板!

6.6.4. 使用 KServe Inference 服务复制链接链接已复制到粘贴板!

6.6.5. 设置 LM-Eval S3 支持复制链接链接已复制到粘贴板!

6.6.6. 使用带有 LM-Eval 的 LLM-as-a-Judge 指标复制链接链接已复制到粘贴板!

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

6.6.1. 使用环境变量令牌访问 Hugging Face 模型
复制链接

6.6.2. 使用自定义 Unitxt 卡
复制链接

6.6.3. 使用 PVC 作为存储
复制链接

6.6.3.1. 管理的 PVC
复制链接

6.6.3.2. 现有 PVC
复制链接

6.6.4. 使用 KServe Inference 服务
复制链接

6.6.5. 设置 LM-Eval S3 支持
复制链接

6.6.6. 使用带有 LM-Eval 的 LLM-as-a-Judge 指标
复制链接