Chapter 2. Evaluating large language models

2.1. Setting up LM-Eval
Copy link

LM-Eval is a service designed for evaluating large language models that has been integrated into the TrustyAI Operator.

The service is built on top of two open-source projects:

LM Evaluation Harness, developed by EleutherAI, that provides a comprehensive framework for evaluating language models
Unitxt, a tool that enhances the evaluation process with additional functionalities

The following information explains how to create an LMEvalJob custom resource (CR) to initiate an evaluation job and get the results.

Global settings for LM-Eval

Configurable global settings for LM-Eval services are stored in the TrustyAI operator global ConfigMap, named trustyai-service-operator-config. The global settings are located in the same namespace as the operator.

You can configure the following properties for LM-Eval:

Expand

Table 2.1. LM-Eval properties
Property	Default	Description
`lmes-detect-device`	`true/false`	Detect if there are GPUs available and assign a value for the `--device argument` for LM Evaluation Harness. If GPUs are available, the value is `cuda`. If there are no GPUs available, the value is `cpu`.
`lmes-pod-image`	`quay.io/trustyai/ta-lmes-job:latest`	The image for the LM-Eval job. The image contains the Python packages for LM Evaluation Harness and Unitxt.
`lmes-driver-image`	`quay.io/trustyai/ta-lmes-driver:latest`	The image for the LM-Eval driver. For detailed information about the driver, see the `cmd/lmes_driver` directory.
`lmes-image-pull-policy`	`Always`	The image-pulling policy when running the evaluation job.
`lmes-default-batch-size`	8	The default batch size when invoking the model inference API. Default batch size is only available for local models.
`lmes-max-batch-size`	24	The maximum batch size that users can specify in an evaluation job.
`lmes-pod-checking-interval`	10s	The interval to check the job pod for an evaluation job.

After updating the settings in the ConfigMap, restart the operator to apply the new values.

2.2. Enabling external resource access for LMEval jobs
Copy link

LMEval jobs do not allow internet access or remote code execution by default. When configuring an LMEvalJob, it may require access to external resources, for example task datasets and model tokenizers, usually hosted on Hugging Face. If you trust the source and have reviewed the content of these artifacts, an LMEvalJob can be configured to automatically download them.

Follow the steps below to enable online access and remote code execution for LMEval jobs. Choose to update these settings by using either the CLI or in the console. Enable one or both settings according to your needs.

2.2.1. Enabling online access and remote code execution for LMEval Jobs using the CLI
Copy link

You can enable online access using the CLI for LMEval jobs by setting the allowOnline specification to true in the LMEvalJob custom resource (CR). You can also enable remote code execution by setting the allowCodeExecution specification to true. Both modes can be used at the same time.

Important

Enabling online access or code execution involves a security risk. Only use these configurations if you trust the source(s).

Prerequisites

You have cluster administrator privileges for your OpenShift cluster.
You have downloaded and installed the OpenShift AI command-line interface (CLI). See Installing the OpenShift CLI.

Procedure

Get the current DataScienceCluster resource, which is located in the redhat-ods-operator namespace:
```
oc get datasciencecluster -n redhat-ods-operator
```
```
$ oc get datasciencecluster -n redhat-ods-operator
```
Copy to Clipboard Toggle word wrap
Example output
```
NAME                 AGE
default-dsc          10d
```
```
NAME                 AGE
default-dsc          10d
```
Copy to Clipboard Toggle word wrap
Enable online access and code execution for the cluster in the DataScienceCluster resource with the permitOnline and permitCodeExecution specifications. For example, create a file named allow-online-code-exec-dsc.yaml with the following contents:
Example allow-online-code-exec-dsc.yaml resource enabling online access and remote code execution
```
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
  name: default-dsc
spec:
# ...
  components:
    trustyai:
      managementState: Managed
      eval:
        lmeval:
           permitOnline: allow
           permitCodeExecution: allow
# ...
```
```
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
  name: default-dsc
spec:
# ...
  components:
    trustyai:
      managementState: Managed
      eval:
        lmeval:
           permitOnline: allow
           permitCodeExecution: allow
# ...
```
Copy to Clipboard Toggle word wrap
The permitCodeExecution and permitOnline settings are disabled by default with a value of deny. You must explicitly enable these settings in the DataScienceCluster resource for the LMEvalJob instance to enable internet access or permission to run any externally downloaded code.

Apply the updated DataScienceCluster:

oc apply -f allow-online-code-exec-dsc.yaml -n redhat-ods-operator

$ oc apply -f allow-online-code-exec-dsc.yaml -n redhat-ods-operator

Copy to Clipboard

Toggle word wrap

Optional: Run the following command to check that the DataScienceCluster is in a healthy state:
```
oc get datasciencecluster default-dsc
```
```
$ oc get datasciencecluster default-dsc
```
Copy to Clipboard Toggle word wrap
Example output
```
NAME          READY   REASON
default-dsc   True
```
```
NAME          READY   REASON
default-dsc   True
```
Copy to Clipboard Toggle word wrap

For new LMEval jobs, define the job in a YAML file as shown in the following example. This configuration requests both internet access, with allowOnline: true, and permission for remote code execution with, allowCodeExecution: true:
Example lmevaljob-with-online-code-exec.yaml
```
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: lmevaljob-with-online-code-exec
  namespace: <your_namespace>
spec:
# ...
  allowOnline: true
  allowCodeExecution: true
# ...
```
```
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: lmevaljob-with-online-code-exec
  namespace: <your_namespace>
spec:
# ...
  allowOnline: true
  allowCodeExecution: true
# ...
```
Copy to Clipboard Toggle word wrap
The allowOnline and allowCodeExecution settings are disabled by default with a value of false in the LMEvalJob CR.

Deploy the LMEval Job:

oc apply -f lmevaljob-with-online-code-exec.yaml -n <your_namespace>

$ oc apply -f lmevaljob-with-online-code-exec.yaml -n <your_namespace>

Copy to Clipboard

Toggle word wrap

Important

If you upgrade to version 2.25, some TrustyAI LMEvalJob CR configuration values might be overwritten. The new deployment prioritizes the value on the 2.25 version DataScienceCluster. Existing LMEval jobs are unaffected. Verify that all DataScienceCluster values are explicitly defined and validated during installation.

Verification

Run the following command to verify that the DataScienceCluster has the updated fields:

oc get datasciencecluster default-dsc -n redhat-ods-operator -o "jsonpath={.data}"

$ oc get datasciencecluster default-dsc -n redhat-ods-operator -o "jsonpath={.data}"

Copy to Clipboard

Toggle word wrap

Run the following command to verify that the trustyai-dsc-config ConfigMap has the same flag values set in the DataScienceCluster.

oc get configmaps trustyai-dsc-config -n redhat-ods-applications -o "jsonpath={.spec.components.trustyai.eval.lmeval}"

$ oc get configmaps trustyai-dsc-config -n redhat-ods-applications -o "jsonpath={.spec.components.trustyai.eval.lmeval}"

Copy to Clipboard

Toggle word wrap

Example output

{"eval.lmeval.permitCodeExecution":"true","eval.lmeval.permitOnline":"true"}

{"eval.lmeval.permitCodeExecution":"true","eval.lmeval.permitOnline":"true"}

Copy to Clipboard

Toggle word wrap

2.2.2. Updating LMEval job configuration using the web console
Copy link

Follow these steps to enable online access (allowOnline) and remote code execution (allowCodeExecution) modes through the OpenShift AI web console for LMEval jobs.

Important

Enabling online access or code execution involves a security risk. Only use these configurations if you trust the source(s).

Prerequisites

You have cluster administrator privileges for your Red Hat OpenShift AI cluster.

Procedure

In the OpenShift console, click Operators Installed Operators.
Search for the Red Hat OpenShift AI Operator, and then click the Operator name to open the Operator details page.
Click the Data Science Cluster tab.
Click the default instance name (for example, default-dsc) to open the instance details page.
Click the YAML tab to show the instance specifications.

In the spec:components:trustyai:eval:lmeval section, set the permitCodeExecution and permitOnline fields to a value of allow:

spec:
  components:
    trustyai:
      managementState: Managed
      eval:
        lmeval:
           permitOnline: allow
           permitCodeExecution: allow

spec:
  components:
    trustyai:
      managementState: Managed
      eval:
        lmeval:
           permitOnline: allow
           permitCodeExecution: allow

Copy to Clipboard

Toggle word wrap

Click Save.
From the Project drop-down list, select the project that contains the LMEval job you are working with.
From the Resources drop-down list, select the LMEvalJob instance that you are working with.
Click Actions Edit YAML

Ensure that the allowOnline and allowCodeExecution are set to true to enable online access and code execution for this job when writing your LMEvalJob custom resource:

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: example-lmeval
spec:
  allowOnline: true
  allowCodeExecution: true

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: example-lmeval
spec:
  allowOnline: true
  allowCodeExecution: true

Copy to Clipboard

Toggle word wrap

Click Save.

Expand

Table 2.2. Configuration keys for LMEvalJob custom resource
Field	Default	Description
`spec.allowOnline`	`false`	Enables this job to access the internet (e.g., to download datasets or tokenizers).
`spec.allowCodeExecution`	`false`	Allows this job to run code included with downloaded resources.

2.3. LM-Eval evaluation job
Copy link

LM-Eval service defines a new Custom Resource Definition (CRD) called LMEvalJob. An LMEvalJob object represents an evaluation job. LMEvalJob objects are monitored by the TrustyAI Kubernetes operator.

To run an evaluation job, create an LMEvalJob object with the following information: model, model arguments, task, and secret.

Note

For a list of TrustyAI-supported tasks, see LMEval task support.

After the LMEvalJob is created, the LM-Eval service runs the evaluation job. The status and results of the LMEvalJob object update when the information is available.

Note

Other TrustyAI features (such as bias and drift metrics) cannot be used with non-tabular models (including LLMs). Deploying the TrustyAIService custom resource (CR) in a namespace that contains non-tabular models (such as the namespace where an evaluation job is being executed) can cause errors within the TrustyAI service.

Sample LMEvalJob object

The sample LMEvalJob object contains the following features:

The google/flan-t5-base model from Hugging Face.
The dataset from the wnli card, a subset of the GLUE (General Language Understanding Evaluation) benchmark evaluation framework from Hugging Face. For more information about the wnli Unitxt card, see the Unitxt website.
The following default parameters for the multi_class.relation Unitxt task: f1_micro, f1_macro, and accuracy. This template can be found on the Unitxt website: click Catalog, then click Tasks and select Classification from the menu.

The following is an example of an LMEvalJob object:

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  model: hf
  modelArgs:
  - name: pretrained
    value: google/flan-t5-base
  taskList:
    taskRecipes:
    - card:
        name: "cards.wnli"
      template: "templates.classification.multi_class.relation.default"
  logSamples: true

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  model: hf
  modelArgs:
  - name: pretrained
    value: google/flan-t5-base
  taskList:
    taskRecipes:
    - card:
        name: "cards.wnli"
      template: "templates.classification.multi_class.relation.default"
  logSamples: true

Copy to Clipboard

Toggle word wrap

After you apply the sample LMEvalJob, check its state by using the following command:

oc get lmevaljob evaljob-sample

oc get lmevaljob evaljob-sample

Copy to Clipboard

Toggle word wrap

Output similar to the following appears: NAME: evaljob-sample STATE: Running

Evaluation results are available when the state of the object changes to Complete. Both the model and dataset in this example are small. The evaluation job should finish within 10 minutes on a CPU-only node.

Use the following command to get the results:

oc get lmevaljobs.trustyai.opendatahub.io evaljob-sample \
  -o template --template={{.status.results}} | jq '.results'

oc get lmevaljobs.trustyai.opendatahub.io evaljob-sample \
  -o template --template={{.status.results}} | jq '.results'

Copy to Clipboard

Toggle word wrap

The command returns results similar to the following example:

{
  "tr_0": {
    "alias": "tr_0",
    "f1_micro,none": 0.5633802816901409,
    "f1_micro_stderr,none": "N/A",
    "accuracy,none": 0.5633802816901409,
    "accuracy_stderr,none": "N/A",
    "f1_macro,none": 0.36036036036036034,
    "f1_macro_stderr,none": "N/A"
  }
}

{
  "tr_0": {
    "alias": "tr_0",
    "f1_micro,none": 0.5633802816901409,
    "f1_micro_stderr,none": "N/A",
    "accuracy,none": 0.5633802816901409,
    "accuracy_stderr,none": "N/A",
    "f1_macro,none": 0.36036036036036034,
    "f1_macro_stderr,none": "N/A"
  }
}

Copy to Clipboard

Toggle word wrap

Notes on the results

The f1_micro, f1_macro, and accuracy scores are 0.56, 0.36, and 0.56.
The full results are stored in the .status.results of the LMEvalJob object as a JSON document.
The command above only retrieves the results field of the JSON document.

Note

The provided LMEvalJob uses a dataset from the wnli card, which is in Parquet format and not supported on s390x. To run on s390x, choose a task that uses a non-Parquet dataset.

2.4. LM-Eval evaluation job properties
Copy link

The LMEvalJob object contains the following features:

The google/flan-t5-base model.
The dataset from the wnli card, from the GLUE (General Language Understanding Evaluation) benchmark evaluation framework.
The multi_class.relation Unitxt task default parameters.

The following table lists each property in the LMEvalJob and its usage:

Expand

Table 2.3. LM-EvalJob properties
Parameter	Description
`model`	Specifies which model type or provider is evaluated. This field directly maps to the `--model` argument of the `lm-evaluation-harness`. The model types and providers that you can use include: `hf`: HuggingFace models `openai-completions`: OpenAI Completions API models `openai-chat-completions`: OpenAI Chat Completions API models `local-completions` and `local-chat-completions`: OpenAI API-compatible servers `textsynth`: TextSynth APIs
`modelArgs`	A list of paired name and value arguments for the model type. Arguments vary by model provider. You can find further details in the models section of the LM Evaluation Harness library on GitHub. Below are examples for some providers: `hf`: The model designation for the HuggingFace provider `local-completions`: An OpenAI API-compatible server `local-chat-completions`: An OpenAI API-compatible server `openai-completions`: OpenAI Completions API models `openai-chat-completions`: ChatCompletions API models `textsynth`: TextSynth APIs
`taskList.taskNames`	Specifies a list of tasks supported by `lm-evaluation-harness`.
`taskList.taskRecipes`	Specifies the task using the Unitxt recipe format: `card`: Use the `name` to specify a Unitxt card or `ref` to refer to a custom card: `name`: Specifies a Unitxt card from the catalog section of the Unitxt. Use the card ID as the value. For example, the ID of the Wnli card is `cards.wnli`. `ref`: Specifies the reference name of a custom card as defined in the `custom` section. If the dataset used by the custom card requires an API key from an environment variable or a persistent volume, configure the necessary resources in the `pod` field. `template`: Specifies a Unitxt template from the Unitxt catalog. Use `name` to specify a Unitxt catalog template or `ref` to refer to a custom template: `name`: Specifies a Unitxt template from the catalog of cards on the Unitxt website. Use the template’s ID as the value. `ref`: Specifies the reference name of a custom template as defined in the `custom` section. `systemPrompt`: Use `name` to specify a Unitxt catalog system prompt or `ref` to refer to a custom prompt: `name`: Specifies a Unitxt system prompt from the catalog on the Unitxt website. Use the system prompt’s ID as the value. `ref`: Specifies the reference name of a custom system prompt as defined in the `custom` section. `task` (optional): Specifies a Unitxt task from the Unitxt catalog. Use the task ID as the value. A Unitxt card has a predefined task. Only specify a value for this if you want to run a different task. `metrics` (optional): Specifies a Unitxt task from the Unitxt catalog. Use the metric ID as the value. A Unitxt task has a set of pre-defined metrics. Only specify a set of metrics if you need different metrics. `format` (optional): Specifies a Unitxt format from the Unitxt catalog. Use the format ID as the value. `loaderLimit` (optional): Specifies the maximum number of instances per stream to be returned from the loader. You can use this parameter to reduce loading time in large datasets. `numDemos` (optional): Number of few-shot to be used. `demosPoolSize` (optional): Size of the few-shot pool.
`numFewShot`	Sets the number of few-shot examples to place in context. If you are using a task from Unitxt, do not use this field. Use `numDemos` under the `taskRecipes` instead.
`limit`	Set a limit to run the tasks instead of running the entire dataset. Accepts either an integer or a float between `0.0` and `1.0`.
`genArgs`	Maps to the `--gen_kwargs` parameter for the `lm-evaluation-harness`. For more information, see the LM Evaluation Harness documentation on GitHub.
`logSamples`	If this flag is passed, then the model outputs and the text fed into the model are saved at per-prompt level.
`batchSize`	Specifies the batch size for the evaluation in integer format. The `auto:N` batch size is not used for API models, but numeric batch sizes are used for APIs.
`pod`	Specifies extra information for the `lm-eval` job pod: `container`: Specifies additional container settings for the `lm-eval` container. `env`: Specifies environment variables. This parameter uses the `EnvVar` data structure of Kubernetes. `volumeMounts`: Mounts the volumes into the `lm-eval` container. `resources`: Specifies the resources for the `lm-eval` container. `volumes`: Specifies the volume information for the `lm-eval` and other containers. This parameter uses the `Volume` data structure of Kubernetes. `sideCars`: A list of containers that run along with the `lm-eval` container. This parameter uses the `Container` data structure of Kubernetes.
`outputs`	This parameter defines a custom output location to store the the evaluation results. Only Persistent Volume Claims (PVC) are supported.
`outputs.pvcManaged`	Creates an operator-managed PVC to store the job results. The PVC is named `<job-name>-pvc` and is owned by the `LMEvalJob`. After the job finishes, the PVC is still available, but it is deleted with the `LMEvalJob`. Supports the following fields: `size`: The PVC size, compatible with standard PVC syntax (for example, 5Gi).
`outputs.pvcName`	Binds an existing PVC to a job by specifying its name. The PVC must be created separately and must already exist when creating the job.
`allowOnline`	If this parameter is set to `true`, the LMEval job downloads artifacts as needed (for example, models, datasets or tokenizers). If set to `false`, artifacts are not downloaded and are pulled from local storage instead. This setting is disabled by default. If you want to enable `allowOnline` mode, you can deploy a new `LMEvalJob` CR with `allowOnline` set to `true` as long as the `DataScienceCluster` resource specification `permitOnline` is also set to `true`.
`allowCodeExecution`	If this parameter is set to `true`, the LMEval job runs the necessary code for preparing models or datasets. If set to `false` it does not run downloaded code. The default setting for this parameter is `false`. If you want to enable `allowCodeExecution` mode, you can deploy a new `LMEvalJob` CR with `allowCodeExecution` set to `true` as long as the `DataScienceCluster` resource specification `permitCodeExecution` is also set to `true`.
`offline`	Mount a PVC as the local storage for models and datasets.
`systemInstruction`	(Optional) Sets the system instruction for all prompts passed to the evaluated model.
`chatTemplate`	Applies the specified chat template to prompts. Contains two fields: * `enabled`: If set to `true`, a chat template is used. If set to `false`, no template is used. * `name`: Uses the template name, if provided. If no name argument is provided, uses the default template for the model.

2.4.1. Properties for setting up custom Unitxt cards, templates, or system prompts
Copy link

You can choose to set up custom Unitxt cards, templates, or system prompts. Use the parameters set out in the Custom Unitxt parameters table in addition to the preceding table parameters to set customized Unitxt items:

Expand

Table 2.4. Custom Unitxt parameters
Parameter	Description
`taskList.custom`	Defines one or more custom resources that is referenced in a task recipe. The following custom cards, templates, and system prompts are supported: `cards`: Defines custom cards to use, each with a `name` and `value` field: `name`: The name of this custom card that is referenced in the `card.ref` field of a task recipe. `value`: A JSON string for a custom Unitxt card that contains the custom dataset. To compose a custom card, store it as a JSON file, and use the JSON content as the value. If the dataset used by the custom card needs an API key from an environment variable or a persistent volume, set up corresponding resources under the `pod` field in the LMEvalJob` properties table. `templates`: Define custom templates to use, each with a `name` and `value` field: `name`: The name of this custom template that is referenced in the `template.ref` field of a task recipe. `value`: A JSON string for a custom Unitxt template. Store `value` as a JSON file and use the JSON content as the value of this field. `systemPrompts`: Defines custom system prompts to use, each with a `name` and `value` field: `name`: The name of this custom system prompt that is referenced in the `systemPrompt.ref` field of a task recipe. `value`: A string for a custom Unitxt system prompt. You can see an overview of the different components that make up a prompt format, including the system prompt, on the Unitxt website.

2.5. Performing model evaluations in the dashboard
Copy link

LM-Eval is a Language Model Evaluation as a Service (LM-Eval-aaS) feature integrated into the TrustyAI Operator. It offers a unified framework for testing generative language models across a wide variety of evaluation tasks. You can use LM-Eval through the Red Hat OpenShift AI dashboard or the OpenShift CLI (oc). These instructions are for using the dashboard.

Important

Model evaluation through the dashboard is currently available in Red Hat OpenShift AI 3.0 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process. For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Prerequisites

You have logged in to Red Hat OpenShift AI with administrator privileges.
You have enabled the TrustyAI component, as described in Enabling the TrustyAI component.
You have created a project in OpenShift AI.
You have deployed an LLM model in your project.

Note

By default, the Develop & train Evaluations page is hidden from the dashboard navigation menu. To show the Develop & train Evaluations page in the dashboard, go to the OdhDashboardConfig custom resource (CR) in Red Hat OpenShift AI and set the disableLMEval value to false. For more information about enabling dashboard configuration options, see Dashboard configuration options.

Procedure

In the dashboard, click Develop & train Evaluations. The Evaluations page opens. It contains:
1. A Start evaluation run button. If you have not run any previous evaluations, only this button is displayed.
2. A list of evaluations you have previously run, if any exist.
3. A Project dropdown option you can click to show the evaluations relating to one project instead of all projects.
4. A filter to sort your evaluations by model or evaluation name.
The following table outlines the elements and functions of the evaluations list:

Expand

Table 2.5. Evaluations list components
Property	Function
Evaluation	The name of the evaluation.
Model	The model that was used in the evaluation.
Evaluated	The date and time when the evaluation was created.
Status	The status of your evaluation: running, completed, or failed.
More options icon	Click this icon to access the options to delete the evaluation, or download the evaluation log in JSON format.

From the Project dropdown menu, select the namespace of the project where you want to evaluate the model.
Click the Start evaluation run button. The Model evaluation form is displayed.
Fill in the details of the form. The model argument summary is displayed after you complete the form details:
1. Model name: Select a model from all the deployed LLMs in your project.
2. Evaluation name: Give your evaluation a unique name.
3. Tasks: Choose one or more evaluation tasks against which to measure your LLM. The 100 most common evaluation tasks are supported.
4. Model type: Choose the type of model based on the type of prompt-formatting you use:
  1. Local-completion: You assemble the entire prompt chain yourself. Use this when you want to evaluate models that take a plain text prompt and return a continuation.
  2. Local-chat-completion: The framework injects roles or templates automatically. Use this for models that simulate a conversation by taking a list of chat messages with roles like user and assistant and reply appropriately.
5. Security settings:
  1. Available online: Choose enable to allow your model to access the internet to download datasets.
  2. Trust remote code: Choose enable to allow your model to trust code from outside of the project namespace.
    Note
    The Security settings section is grayed out if the security option in global settings is set to active.
Observe that a model argument summary is displayed as soon as you fill in the form details.
Complete the tokenizer settings:
1. Tokenized requests: If set to true, the evaluation requests are broken down into tokens. If set to false, the evaluation dataset remains as raw text.
2. Tokenizer: Type the model’s tokenizer URL that is required for the evaluations.
Click Evaluate. The screen returns to the model evaluation page of your project and your job is displayed in the evaluations list.
Note
- It can take time for your evaluation to complete, depending on factors including hardware support, model size, and the type of evaluation task(s). The status column reports the current status of the evaluation: completed, running, or failed.
- If your evaluation fails, the evaluation pod logs in your cluster provide more information.

2.6. LM-Eval scenarios
Copy link

The following procedures outline example scenarios that can be useful for an LM-Eval setup.

2.6.1. Accessing Hugging Face models with an environment variable token
Copy link

If the LMEvalJob needs to access a model on HuggingFace with the access token, you can set up the HF_TOKEN as one of the environment variables for the lm-eval container.

Prerequisites

You have logged in to Red Hat OpenShift AI.
Your cluster administrator has installed OpenShift AI and enabled the TrustyAI service for the project where the models are deployed.

Procedure

To start an evaluation job for a huggingface model, apply the following YAML file to your project through the CLI:

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  model: hf
  modelArgs:
  - name: pretrained
    value: huggingfacespace/model
  taskList:
    taskNames:
    - unfair_tos/
  logSamples: true
  pod:
    container:
      env:
      - name: HF_TOKEN
        value: "My HuggingFace token"

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  model: hf
  modelArgs:
  - name: pretrained
    value: huggingfacespace/model
  taskList:
    taskNames:
    - unfair_tos/
  logSamples: true
  pod:
    container:
      env:
      - name: HF_TOKEN
        value: "My HuggingFace token"

Copy to Clipboard

Toggle word wrap

For example:

oc apply -f <yaml_file> -n <project_name>

$ oc apply -f <yaml_file> -n <project_name>

Copy to Clipboard

Toggle word wrap

(Optional) You can also create a secret to store the token, then refer the key from the secretKeyRef object using the following reference syntax:

env:
  - name: HF_TOKEN
    valueFrom:
      secretKeyRef:
        name: my-secret
        key: hf-token

env:
  - name: HF_TOKEN
    valueFrom:
      secretKeyRef:
        name: my-secret
        key: hf-token

Copy to Clipboard

Toggle word wrap

2.6.2. Using a custom Unitxt card
Copy link

You can run evaluations using custom Unitxt cards. To do this, include the custom Unitxt card in JSON format within the LMEvalJob YAML.

Prerequisites

You have logged in to Red Hat OpenShift AI.
Your cluster administrator has installed OpenShift AI and enabled the TrustyAI service for the project where the models are deployed.

Procedure

Pass a custom Unitxt Card in JSON format:

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  model: hf
  modelArgs:
  - name: pretrained
    value: google/flan-t5-base
  taskList:
    taskRecipes:
    - template: "templates.classification.multi_class.relation.default"
      card:
        custom: |
          {
            "__type__": "task_card",
            "loader": {
              "__type__": "load_hf",
              "path": "glue",
              "name": "wnli"
            },
            "preprocess_steps": [
              {
                "__type__": "split_random_mix",
                "mix": {
                  "train": "train[95%]",
                  "validation": "train[5%]",
                  "test": "validation"
                }
              },
              {
                "__type__": "rename",
                "field": "sentence1",
                "to_field": "text_a"
              },
              {
                "__type__": "rename",
                "field": "sentence2",
                "to_field": "text_b"
              },
              {
                "__type__": "map_instance_values",
                "mappers": {
                  "label": {
                    "0": "entailment",
                    "1": "not entailment"
                  }
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "classes": [
                    "entailment",
                    "not entailment"
                  ]
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "type_of_relation": "entailment"
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "text_a_type": "premise"
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "text_b_type": "hypothesis"
                }
              }
            ],
            "task": "tasks.classification.multi_class.relation",
            "templates": "templates.classification.multi_class.relation.all"
          }
  logSamples: true

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  model: hf
  modelArgs:
  - name: pretrained
    value: google/flan-t5-base
  taskList:
    taskRecipes:
    - template: "templates.classification.multi_class.relation.default"
      card:
        custom: |
          {
            "__type__": "task_card",
            "loader": {
              "__type__": "load_hf",
              "path": "glue",
              "name": "wnli"
            },
            "preprocess_steps": [
              {
                "__type__": "split_random_mix",
                "mix": {
                  "train": "train[95%]",
                  "validation": "train[5%]",
                  "test": "validation"
                }
              },
              {
                "__type__": "rename",
                "field": "sentence1",
                "to_field": "text_a"
              },
              {
                "__type__": "rename",
                "field": "sentence2",
                "to_field": "text_b"
              },
              {
                "__type__": "map_instance_values",
                "mappers": {
                  "label": {
                    "0": "entailment",
                    "1": "not entailment"
                  }
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "classes": [
                    "entailment",
                    "not entailment"
                  ]
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "type_of_relation": "entailment"
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "text_a_type": "premise"
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "text_b_type": "hypothesis"
                }
              }
            ],
            "task": "tasks.classification.multi_class.relation",
            "templates": "templates.classification.multi_class.relation.all"
          }
  logSamples: true

Copy to Clipboard

Toggle word wrap

Inside the custom card specify the Hugging Face dataset loader:

"loader": {
              "__type__": "load_hf",
              "path": "glue",
              "name": "wnli"
            },

"loader": {
              "__type__": "load_hf",
              "path": "glue",
              "name": "wnli"
            },

Copy to Clipboard

Toggle word wrap

(Optional) You can use other Unitxt loaders (found on the Unitxt website) that contain the volumes and volumeMounts parameters to mount the dataset from persistent volumes. For example, if you use the LoadCSV Unitxt command, mount the files to the container and make the dataset accessible for the evaluation process.

Note

The provided scenario example does not work on s390x, as it uses a Parquet-type dataset, which is not supported on this architecture. To run the scenario on s390x, use a task with a non-Parquet dataset.

2.6.3. Using PVCs as storage
Copy link

To use a PVC as storage for the LMEvalJob results, you can use either managed PVCs or existing PVCs. Managed PVCs are managed by the TrustyAI operator. Existing PVCs are created by the end-user before the LMEvalJob is created.

Note

If both managed and existing PVCs are referenced in outputs, the TrustyAI operator defaults to the managed PVC.

Prerequisites

You have logged in to Red Hat OpenShift AI.
Your cluster administrator has installed OpenShift AI and enabled the TrustyAI service for the project where the models are deployed.

2.6.3.1. Managed PVCs
Copy link

To create a managed PVC, specify its size. The managed PVC is named <job-name>-pvc and is available after the job finishes. When the LMEvalJob is deleted, the managed PVC is also deleted.

Procedure

Enter the following code:

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  # other fields omitted ...
  outputs:
    pvcManaged:
      size: 5Gi

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  # other fields omitted ...
  outputs:
    pvcManaged:
      size: 5Gi

Copy to Clipboard

Toggle word wrap

Notes on the code

outputs is the section for specifying custom storage locations
pvcManaged will create an operator-managed PVC
size (compatible with standard PVC syntax) is the only supported value

2.6.3.2. Existing PVCs
Copy link

To use an existing PVC, pass its name as a reference. The PVC must exist when you create the LMEvalJob. The PVC is not managed by the TrustyAI operator, so it is available after deleting the LMEvalJob.

Procedure

Create a PVC. An example is the following:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: "my-pvc"
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: "my-pvc"
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

Copy to Clipboard

Toggle word wrap

Reference the new PVC from the LMEvalJob.

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  # other fields omitted ...
  outputs:
    pvcName: "my-pvc"

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  # other fields omitted ...
  outputs:
    pvcName: "my-pvc"

Copy to Clipboard

Toggle word wrap

2.6.4. Using a KServe Inference Service
Copy link

To run an evaluation job on an InferenceService which is already deployed and running in your namespace, define your LMEvalJob CR, then apply this CR into the same namespace as your model.

NOTE

The following example only works with Hugging Face or vLLM-based model-serving runtimes.

Prerequisites

You have logged in to Red Hat OpenShift AI.
Your cluster administrator has installed OpenShift AI and enabled the TrustyAI service for the project where the models are deployed.
You have a namespace that contains an InferenceService with a vLLM model. This example assumes that a vLLM model is already deployed in your cluster.
Your cluster has Domain Name System (DNS) configured.

Procedure

Define your LMEvalJob CR:

  apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob
spec:
  model: local-completions
  taskList:
    taskNames:
      - mmlu
  logSamples: true
  batchSize: 1
  modelArgs:
    - name: model
      value: granite
    - name: base_url
      value: $ROUTE_TO_MODEL/v1/completions
    - name: num_concurrent
      value:  "1"
    - name: max_retries
      value:  "3"
    - name: tokenized_requests
      value: false
    - name: tokenizer
      value: huggingfacespace/model
 env:
   - name: OPENAI_TOKEN
     valueFrom:
          secretKeyRef:
            name: <secret-name>
            key: token

  apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob
spec:
  model: local-completions
  taskList:
    taskNames:
      - mmlu
  logSamples: true
  batchSize: 1
  modelArgs:
    - name: model
      value: granite
    - name: base_url
      value: $ROUTE_TO_MODEL/v1/completions
    - name: num_concurrent
      value:  "1"
    - name: max_retries
      value:  "3"
    - name: tokenized_requests
      value: false
    - name: tokenizer
      value: huggingfacespace/model
 env:
   - name: OPENAI_TOKEN
     valueFrom:
          secretKeyRef:
            name: <secret-name>
            key: token

Copy to Clipboard

Toggle word wrap

Apply this CR into the same namespace as your model.

Verification

A pod spins up in your model namespace called evaljob. In the pod terminal, you can see the output via tail -f output/stderr.log.

Notes on the code

base_url should be set to the route/service URL of your model. Make sure to include the /v1/completions endpoint in the URL.
env.valueFrom.secretKeyRef.name should point to a secret that contains a token that can authenticate to your model. secretRef.name should be the secret’s name in the namespace, while secretRef.key should point at the token’s key within the secret.

secretKeyRef.name can equal the output of:

oc get secrets -o custom-columns=SECRET:.metadata.name --no-headers | grep user-one-token

oc get secrets -o custom-columns=SECRET:.metadata.name --no-headers | grep user-one-token

Copy to Clipboard

Toggle word wrap

secretKeyRef.key is set to token

2.6.5. Setting up LM-Eval S3 Support
Copy link

Learn how to set up S3 support for your LM-Eval service.

Prerequisites

You have logged in to Red Hat OpenShift AI.
Your cluster administrator has installed OpenShift AI and enabled the TrustyAI service for the project where the models are deployed.
You have a namespace that contains an S3-compatible storage service and bucket.
You have created an LMEvalJob that references the S3 bucket containing your model and dataset.
You have an S3 bucket that contains the model files and the dataset(s) to be evaluated.

Procedure

Create a Kubernetes Secret containing your S3 connection details:

apiVersion: v1
kind: Secret
metadata:
    name: "s3-secret"
    namespace: test
    labels:
        opendatahub.io/dashboard: "true"
        opendatahub.io/managed: "true"
    annotations:
        opendatahub.io/connection-type: s3
        openshift.io/display-name: "S3 Data Connection - LMEval"
data:
    AWS_ACCESS_KEY_ID: BASE64_ENCODED_ACCESS_KEY  # Replace with your key
    AWS_SECRET_ACCESS_KEY: BASE64_ENCODED_SECRET_KEY  # Replace with your key
    AWS_S3_BUCKET: BASE64_ENCODED_BUCKET_NAME  # Replace with your bucket name
    AWS_S3_ENDPOINT: BASE64_ENCODED_ENDPOINT  # Replace with your endpoint URL (for example,  https://s3.amazonaws.com)
    AWS_DEFAULT_REGION: BASE64_ENCODED_REGION  # Replace with your region
type: Opaque

apiVersion: v1
kind: Secret
metadata:
    name: "s3-secret"
    namespace: test
    labels:
        opendatahub.io/dashboard: "true"
        opendatahub.io/managed: "true"
    annotations:
        opendatahub.io/connection-type: s3
        openshift.io/display-name: "S3 Data Connection - LMEval"
data:
    AWS_ACCESS_KEY_ID: BASE64_ENCODED_ACCESS_KEY  # Replace with your key
    AWS_SECRET_ACCESS_KEY: BASE64_ENCODED_SECRET_KEY  # Replace with your key
    AWS_S3_BUCKET: BASE64_ENCODED_BUCKET_NAME  # Replace with your bucket name
    AWS_S3_ENDPOINT: BASE64_ENCODED_ENDPOINT  # Replace with your endpoint URL (for example,  https://s3.amazonaws.com)
    AWS_DEFAULT_REGION: BASE64_ENCODED_REGION  # Replace with your region
type: Opaque

Copy to Clipboard

Toggle word wrap

Note

All values must be base64 encoded. For example: echo -n "my-bucket" | base64

Deploy the LMEvalJob CR that references the S3 bucket containing your model and dataset:

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
    name: evaljob-sample
spec:
    allowOnline: false
    model: hf  # Model type (HuggingFace in this example)
    modelArgs:
        - name: pretrained
          value: /opt/app-root/src/hf_home/flan  # Path where model is mounted in container
    taskList:
        taskNames:
            - arc_easy  # The evaluation task to run
    logSamples: true
    offline:
        storage:
            s3:
                accessKeyId:
                    name: s3-secret
                    key: AWS_ACCESS_KEY_ID
                secretAccessKey:
                    name: s3-secret
                    key: AWS_SECRET_ACCESS_KEY
                bucket:
                    name: s3-secret
                    key: AWS_S3_BUCKET
                endpoint:
                    name: s3-secret
                    key: AWS_S3_ENDPOINT
                region:
                    name: s3-secret
                    key: AWS_DEFAULT_REGION
                path: ""  # Optional subfolder within bucket
                verifySSL: false

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
    name: evaljob-sample
spec:
    allowOnline: false
    model: hf  # Model type (HuggingFace in this example)
    modelArgs:
        - name: pretrained
          value: /opt/app-root/src/hf_home/flan  # Path where model is mounted in container
    taskList:
        taskNames:
            - arc_easy  # The evaluation task to run
    logSamples: true
    offline:
        storage:
            s3:
                accessKeyId:
                    name: s3-secret
                    key: AWS_ACCESS_KEY_ID
                secretAccessKey:
                    name: s3-secret
                    key: AWS_SECRET_ACCESS_KEY
                bucket:
                    name: s3-secret
                    key: AWS_S3_BUCKET
                endpoint:
                    name: s3-secret
                    key: AWS_S3_ENDPOINT
                region:
                    name: s3-secret
                    key: AWS_DEFAULT_REGION
                path: ""  # Optional subfolder within bucket
                verifySSL: false

Copy to Clipboard

Toggle word wrap

Important

The `LMEvalJob` will copy all the files from the specified bucket/path. If your bucket contains many files and you only want to use a subset, set the `path` field to the specific sub-folder containing the files that you require. For example use `path: "my-models/"`.

The `LMEvalJob` will copy all the files from the specified bucket/path. If your bucket contains many files and you only want to use a subset, set the `path` field to the specific sub-folder containing the files that you require. For example use `path: "my-models/"`.

Copy to Clipboard

Toggle word wrap

Set up a secure connection using SSL.

Create a ConfigMap object with your CA certificate:

apiVersion: v1
kind: ConfigMap
metadata:
  name: s3-ca-cert
  namespace: test
  annotations:
    service.beta.openshift.io/inject-cabundle: "true"  # For injection
data: {}  # OpenShift will inject the service CA bundle
# Or add your custom CA:
# data:
#   ca.crt: |-
#     -----BEGIN CERTIFICATE-----
#     ...your CA certificate content...
#     -----END CERTIFICATE-----

apiVersion: v1
kind: ConfigMap
metadata:
  name: s3-ca-cert
  namespace: test
  annotations:
    service.beta.openshift.io/inject-cabundle: "true"  # For injection
data: {}  # OpenShift will inject the service CA bundle
# Or add your custom CA:
# data:
#   ca.crt: |-
#     -----BEGIN CERTIFICATE-----
#     ...your CA certificate content...
#     -----END CERTIFICATE-----

Copy to Clipboard

Toggle word wrap

Update the LMEvalJob to use SSL verification:

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
    name: evaljob-sample
spec:
    # ... same as above ...
    offline:
        storage:
            s3:
                # ... same as above ...
                verifySSL: true  # Enable SSL verification
                caBundle:
                    name: s3-ca-cert  # ConfigMap name containing your CA
                    key: service-ca.crt  # Key in ConfigMap containing the certificate

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
    name: evaljob-sample
spec:
    # ... same as above ...
    offline:
        storage:
            s3:
                # ... same as above ...
                verifySSL: true  # Enable SSL verification
                caBundle:
                    name: s3-ca-cert  # ConfigMap name containing your CA
                    key: service-ca.crt  # Key in ConfigMap containing the certificate

Copy to Clipboard

Toggle word wrap

Verification

After deploying the LMEvalJob, open the kubectl command-line and enter this command to check its status: kubectl logs -n test job/evaljob-sample -n test
View the logs with the kubectl command kubectl logs -n test job/<job-name> to make sure it has functioned correctly.
The results are displayed in the logs after the evaluation is completed.

2.6.6. Using LLM-as-a-Judge metrics with LM-Eval
Copy link

You can use a large language model (LLM) to assess the quality of outputs from another LLM, known as LLM-as-a-Judge (LLMaaJ).

You can use LLMaaJ to:

Assess work with no clearly correct answer, such as creative writing.
Judge quality characteristics such as helpfulness, safety, and depth.
Augment traditional quantitative measures that are used to evaluate a model’s performance (for example, ROUGE metrics).
Test specific quality aspects of your model output.

Follow the custom quality assessment example below to learn more about using your own metrics criteria with LM-Eval to evaluate model responses.

This example uses Unitxt to define custom metrics and to see how the model (flan-t5-small) answers questions from MT-Bench, a standard benchmark. Custom evaluation criteria and instructions from the Mistral-7B model are used to rate the answers from 1-10, based on helpfulness, accuracy, and detail.

Prerequisites

You have logged in to Red Hat OpenShift AI.
You have installed the OpenShift CLI (oc) as described in the appropriate documentation for your cluster:
- Installing the OpenShift CLI for OpenShift Container Platform
- Installing the OpenShift CLI for Red Hat OpenShift Service on AWS
Your cluster administrator has installed OpenShift AI and enabled the TrustyAI service for the project where the models are deployed.
You are familiar with how to use Unitxt.

You have set the following parameters:

Expand

Table 2.6. Parameters
Parameter	Description
Custom template	Tells the judge to assign a score between 1 and 10 in a standardized format, based on specific criteria.
`processors.extract_mt_bench_rating_judgment`	Pulls the numerical rating from the judge’s response.
`formats.models.mistral.instruction`	Formats the prompts for the Mistral model.
Custom LLM-as-judge metric	Uses Mistral-7B with your custom instructions.

Procedure

In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI (oc) as shown in the following example:
```
oc login <openshift_cluster_url> -u <admin_username> -p <password>
```
```
$ oc login <openshift_cluster_url> -u <admin_username> -p <password>
```
Copy to Clipboard Toggle word wrap
Apply the following manifest by using the oc apply -f - command. The YAML content defines a custom evaluation job (LMEvalJob), the namespace, and the location of the model you want to evaluate. The YAML contains the following instructions:
1. Which model to evaluate.
2. What data to use.
3. How to format inputs and outputs.
4. Which judge model to use.
5. How to extract and log results.
  Note
  You can also put the YAML manifest into a file using a text editor and then apply it by using the oc apply -f file.yaml command.

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
 name: custom-eval
 namespace: test
spec:
 allowOnline: true
 allowCodeExecution: true
 model: hf
 modelArgs:
   - name: pretrained
     value: google/flan-t5-small
taskList:
 taskRecipes:
     - card:
         custom: |
           {
               "__type__": "task_card",
               "loader": {
                   "__type__": "load_hf",
                   "path": "OfirArviv/mt_bench_single_score_gpt4_judgement",
                   "split": "train"
               },
               "preprocess_steps": [
                   {
                       "__type__": "rename_splits",
                       "mapper": {
                           "train": "test"
                       }
                   },
                   {
                       "__type__": "filter_by_condition",
                       "values": {
                           "turn": 1
                       },
                       "condition": "eq"
                   },
                   {
                       "__type__": "filter_by_condition",
                       "values": {
                           "reference": "[]"
                       },
                       "condition": "eq"
                   },
                   {
                       "__type__": "rename",
                       "field_to_field": {
                           "model_input": "question",
                           "score": "rating",
                           "category": "group",
                           "model_output": "answer"
                       }
                   },
                   {
                       "__type__": "literal_eval",
                       "field": "question"
                   },
                   {
                       "__type__": "copy",
                       "field": "question/0",
                       "to_field": "question"
                   },
                   {
                       "__type__": "literal_eval",
                       "field": "answer"
                   },
                   {
                       "__type__": "copy",
                       "field": "answer/0",
                       "to_field": "answer"
                   }
               ],
               "task": "tasks.response_assessment.rating.single_turn",
               "templates": [
                   "templates.response_assessment.rating.mt_bench_single_turn"
               ]
           }
       template:
         ref: response_assessment.rating.mt_bench_single_turn
       format: formats.models.mistral.instruction
       metrics:
       - ref: llmaaj_metric
   custom:
     templates:
       - name: response_assessment.rating.mt_bench_single_turn
         value: |
           {
               "__type__": "input_output_template",
               "instruction": "Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: \"[[rating]]\", for example: \"Rating: [[5]]\".\n\n",
               "input_format": "[Question]\n{question}\n\n[The Start of Assistant's Answer]\n{answer}\n[The End of Assistant's Answer]",
               "output_format": "[[{rating}]]",
               "postprocessors": [
                   "processors.extract_mt_bench_rating_judgment"
               ]
           }
     tasks:
       - name: response_assessment.rating.single_turn
         value: |
           {
               "__type__": "task",
               "input_fields": {
                   "question": "str",
                   "answer": "str"
               },
               "outputs": {
                   "rating": "float"
               },
               "metrics": [
                   "metrics.spearman"
               ]
           }
     metrics:
       - name: llmaaj_metric
         value: |
           {
               "__type__": "llm_as_judge",
               "inference_model": {
                   "__type__": "hf_pipeline_based_inference_engine",
                   "model_name": "mistralai/Mistral-7B-Instruct-v0.2",
                   "max_new_tokens": 256,
                   "use_fp16": true
               },
               "template": "templates.response_assessment.rating.mt_bench_single_turn",
               "task": "rating.single_turn",
               "format": "formats.models.mistral.instruction",
               "main_score": "mistral_7b_instruct_v0_2_huggingface_template_mt_bench_single_turn"
           }
 logSamples: true
 pod:
   container:
     env:
       - name: HF_TOKEN
         valueFrom:
           secretKeyRef:
             name: hf-token-secret
             key: token
     resources:
       limits:
         cpu: '2'
         memory: 16Gi

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
 name: custom-eval
 namespace: test
spec:
 allowOnline: true
 allowCodeExecution: true
 model: hf
 modelArgs:
   - name: pretrained
     value: google/flan-t5-small
taskList:
 taskRecipes:
     - card:
         custom: |
           {
               "__type__": "task_card",
               "loader": {
                   "__type__": "load_hf",
                   "path": "OfirArviv/mt_bench_single_score_gpt4_judgement",
                   "split": "train"
               },
               "preprocess_steps": [
                   {
                       "__type__": "rename_splits",
                       "mapper": {
                           "train": "test"
                       }
                   },
                   {
                       "__type__": "filter_by_condition",
                       "values": {
                           "turn": 1
                       },
                       "condition": "eq"
                   },
                   {
                       "__type__": "filter_by_condition",
                       "values": {
                           "reference": "[]"
                       },
                       "condition": "eq"
                   },
                   {
                       "__type__": "rename",
                       "field_to_field": {
                           "model_input": "question",
                           "score": "rating",
                           "category": "group",
                           "model_output": "answer"
                       }
                   },
                   {
                       "__type__": "literal_eval",
                       "field": "question"
                   },
                   {
                       "__type__": "copy",
                       "field": "question/0",
                       "to_field": "question"
                   },
                   {
                       "__type__": "literal_eval",
                       "field": "answer"
                   },
                   {
                       "__type__": "copy",
                       "field": "answer/0",
                       "to_field": "answer"
                   }
               ],
               "task": "tasks.response_assessment.rating.single_turn",
               "templates": [
                   "templates.response_assessment.rating.mt_bench_single_turn"
               ]
           }
       template:
         ref: response_assessment.rating.mt_bench_single_turn
       format: formats.models.mistral.instruction
       metrics:
       - ref: llmaaj_metric
   custom:
     templates:
       - name: response_assessment.rating.mt_bench_single_turn
         value: |
           {
               "__type__": "input_output_template",
               "instruction": "Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: \"[[rating]]\", for example: \"Rating: [[5]]\".\n\n",
               "input_format": "[Question]\n{question}\n\n[The Start of Assistant's Answer]\n{answer}\n[The End of Assistant's Answer]",
               "output_format": "[[{rating}]]",
               "postprocessors": [
                   "processors.extract_mt_bench_rating_judgment"
               ]
           }
     tasks:
       - name: response_assessment.rating.single_turn
         value: |
           {
               "__type__": "task",
               "input_fields": {
                   "question": "str",
                   "answer": "str"
               },
               "outputs": {
                   "rating": "float"
               },
               "metrics": [
                   "metrics.spearman"
               ]
           }
     metrics:
       - name: llmaaj_metric
         value: |
           {
               "__type__": "llm_as_judge",
               "inference_model": {
                   "__type__": "hf_pipeline_based_inference_engine",
                   "model_name": "mistralai/Mistral-7B-Instruct-v0.2",
                   "max_new_tokens": 256,
                   "use_fp16": true
               },
               "template": "templates.response_assessment.rating.mt_bench_single_turn",
               "task": "rating.single_turn",
               "format": "formats.models.mistral.instruction",
               "main_score": "mistral_7b_instruct_v0_2_huggingface_template_mt_bench_single_turn"
           }
 logSamples: true
 pod:
   container:
     env:
       - name: HF_TOKEN
         valueFrom:
           secretKeyRef:
             name: hf-token-secret
             key: token
     resources:
       limits:
         cpu: '2'
         memory: 16Gi

Copy to Clipboard

Toggle word wrap

Verification

A processor extracts the numeric rating from the judge’s natural language response. The final result is available as part of the LMEval Job Custom Resource (CR).

Note

The provided scenario example does not work for s390x. The scenario works with non-Parquet type dataset task for s390x.

2.1. Setting up LM-Eval
Copy link

2.2. Enabling external resource access for LMEval jobs
Copy link

2.2.1. Enabling online access and remote code execution for LMEval Jobs using the CLI
Copy link

2.2.2. Updating LMEval job configuration using the web console
Copy link

2.3. LM-Eval evaluation job
Copy link

2.4. LM-Eval evaluation job properties
Copy link

2.4.1. Properties for setting up custom Unitxt cards, templates, or system prompts
Copy link

2.5. Performing model evaluations in the dashboard
Copy link

2.6. LM-Eval scenarios
Copy link

2.6.1. Accessing Hugging Face models with an environment variable token
Copy link

2.6.2. Using a custom Unitxt card
Copy link

2.6.3. Using PVCs as storage
Copy link

2.6.3.1. Managed PVCs
Copy link

2.6.3.2. Existing PVCs
Copy link

2.6.4. Using a KServe Inference Service
Copy link

2.6.5. Setting up LM-Eval S3 Support
Copy link

2.6.6. Using LLM-as-a-Judge metrics with LM-Eval
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 2. Evaluating large language models

2.1. Setting up LM-EvalCopy linkLink copied to clipboard!

2.2. Enabling external resource access for LMEval jobsCopy linkLink copied to clipboard!

2.2.1. Enabling online access and remote code execution for LMEval Jobs using the CLICopy linkLink copied to clipboard!

2.2.2. Updating LMEval job configuration using the web consoleCopy linkLink copied to clipboard!

2.3. LM-Eval evaluation jobCopy linkLink copied to clipboard!

2.4. LM-Eval evaluation job propertiesCopy linkLink copied to clipboard!

2.4.1. Properties for setting up custom Unitxt cards, templates, or system promptsCopy linkLink copied to clipboard!

2.5. Performing model evaluations in the dashboardCopy linkLink copied to clipboard!

2.6. LM-Eval scenariosCopy linkLink copied to clipboard!

2.6.1. Accessing Hugging Face models with an environment variable tokenCopy linkLink copied to clipboard!

2.6.2. Using a custom Unitxt cardCopy linkLink copied to clipboard!

2.6.3. Using PVCs as storageCopy linkLink copied to clipboard!

2.6.3.1. Managed PVCsCopy linkLink copied to clipboard!

2.6.3.2. Existing PVCsCopy linkLink copied to clipboard!

2.6.4. Using a KServe Inference ServiceCopy linkLink copied to clipboard!

2.6.5. Setting up LM-Eval S3 SupportCopy linkLink copied to clipboard!

2.6.6. Using LLM-as-a-Judge metrics with LM-EvalCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

2.1. Setting up LM-Eval
Copy link

2.2. Enabling external resource access for LMEval jobs
Copy link

2.2.1. Enabling online access and remote code execution for LMEval Jobs using the CLI
Copy link

2.2.2. Updating LMEval job configuration using the web console
Copy link

2.3. LM-Eval evaluation job
Copy link

2.4. LM-Eval evaluation job properties
Copy link

2.4.1. Properties for setting up custom Unitxt cards, templates, or system prompts
Copy link

2.5. Performing model evaluations in the dashboard
Copy link

2.6. LM-Eval scenarios
Copy link

2.6.1. Accessing Hugging Face models with an environment variable token
Copy link

2.6.2. Using a custom Unitxt card
Copy link

2.6.3. Using PVCs as storage
Copy link

2.6.3.1. Managed PVCs
Copy link

2.6.3.2. Existing PVCs
Copy link

2.6.4. Using a KServe Inference Service
Copy link

2.6.5. Setting up LM-Eval S3 Support
Copy link

2.6.6. Using LLM-as-a-Judge metrics with LM-Eval
Copy link