Chapter 6. Evaluating large language models


A large language model (LLM) is a type of artificial intelligence (AI) program that is designed for natural language processing tasks, such as recognizing and generating text.

As a data scientist, you might want to monitor your large language models against a range of metrics, in order to ensure the accuracy and quality of its output. Features such as summarization, language toxicity, and question-answering accuracy can be assessed to inform and improve your model parameters.

Red Hat OpenShift AI now offers Language Model Evaluation as a Service (LM-Eval-aaS), in a feature called LM-Eval. LM-Eval provides a unified framework to test generative language models on a vast range of different evaluation tasks.

The following sections show you how to create an LMEvalJob custom resource (CR) which allows you to activate an evaluation job and generate an analysis of your model’s ability.

6.1. Setting up LM-Eval

LM-Eval is a service designed for evaluating large language models that has been integrated into the TrustyAI Operator.

The service is built on top of two open-source projects:

  • LM Evaluation Harness, developed by EleutherAI, that provides a comprehensive framework for evaluating language models
  • Unitxt, a tool that enhances the evaluation process with additional functionalities

The following information explains how to create an LMEvalJob custom resource (CR) to initiate an evaluation job and get the results.

Global settings for LM-Eval

Configurable global settings for LM-Eval services are stored in the TrustyAI operator global ConfigMap, named trustyai-service-operator-config. The global settings are located in the same namespace as the operator.

You can configure the following properties for LM-Eval:

Expand
Table 6.1. LM-Eval properties
PropertyDefaultDescription

lmes-detect-device

true/false

Detect if there are GPUs available and assign a value for --device argument for LM Evaluation Harness. If GPUs are available, the value is cuda. If there are no GPUs available, the value is cpu.

lmes-pod-image

quay.io/trustyai/ta-lmes-job:latest

The image for the LM-Eval job. The image contains the Python packages for LM Evaluation Harness and Unitxt.

lmes-driver-image

quay.io/trustyai/ta-lmes-driver:latest

The image for the LM-Eval driver. For detailed information about the driver, see the cmd/lmes_driver directory.

lmes-image-pull-policy

Always

The image-pulling policy when running the evaluation job.

lmes-default-batch-size

8

The default batch size when invoking the model inference API. Default batch size is only available for local models.

lmes-max-batch-size

24

The maximum batch size that users can specify in an evaluation job.

lmes-pod-checking-interval

10s

The interval to check the job pod for an evaluation job.

lmes-allow-online

true

Whether LMEval jobs can set the online mode to on to access artifacts (models, datasets, tokenizers) from the internet.

lmes-code-execution

true

Determines whether LMEval jobs can set the trust remote code mode to on.

After updating the settings in the ConfigMap, restart the operator to apply the new values.

Important

The allowOnline setting is disabled by default at the operator level in Red Hat OpenShift AI, as using allowOnline gives the job permissions to automatically download artifacts from external sources.

Enabling allowOnline mode

To enable allowOnline mode, patch the TrustyAI operator ConfigMap with the following code:

 kubectl patch configmap trustyai-service-operator-config -n redhat-ods-applications \
--type merge -p '{"data":{"lmes-allow-online":"true","lmes-allow-code-execution":"true"}}'
Copy to Clipboard Toggle word wrap

Then restart the TrustyAI operator with:

kubectl rollout restart deployment trustyai-service-operator-controller-manager -n redhat-ods-applications
Copy to Clipboard Toggle word wrap

6.2. LM-Eval evaluation job

LM-Eval service defines a new Custom Resource Definition (CRD) called LMEvalJob. An LMEvalJob object represents an evaluation job. LMEvalJob objects are monitored by the TrustyAI Kubernetes operator.

To run an evaluation job, create an LMEvalJob object with the following information: model, model arguments, task, and secret.

After the LMEvalJob is created, the LM-Eval service runs the evaluation job. The status and results of the LMEvalJob object update when the information is available.

Note

Other TrustyAI features (such as bias and drift metrics) do not support non-tabular models (including LLMs). Deploying the TrustyAIService custom resource (CR) in a namespace that contains non-tabular models (such as the namespace where an evaluation job is being executed) can cause errors within the TrustyAI service.

Sample LMEvalJob object

The sample LMEvalJob object contains the following features:

  • The google/flan-t5-base model from Hugging Face.
  • The dataset from the wnli card, a subset of the GLUE (General Language Understanding Evaluation) benchmark evaluation framework from Hugging Face. For more information about the wnli Unitxt card, see the Unitxt website.
  • The following default parameters for the multi_class.relation Unitxt task: f1_micro, f1_macro, and accuracy. This template can be found on the Unitxt website: click Catalog, then click Tasks and select Classification from the menu.

The following is an example of an LMEvalJob object:

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  model: hf
  modelArgs:
  - name: pretrained
    value: google/flan-t5-base
  taskList:
    taskRecipes:
    - card:
        name: "cards.wnli"
      template: "templates.classification.multi_class.relation.default"
  logSamples: true
Copy to Clipboard Toggle word wrap

After you apply the sample LMEvalJob, check its state by using the following command:

oc get lmevaljob evaljob-sample
Copy to Clipboard Toggle word wrap

Output similar to the following appears: NAME: evaljob-sample STATE: Running

Evaluation results are available when the state of the object changes to Complete. Both the model and dataset in this example are small. The evaluation job should finish within 10 minutes on a CPU-only node.

Use the following command to get the results:

oc get lmevaljobs.trustyai.opendatahub.io evaljob-sample \
  -o template --template={{.status.results}} | jq '.results'
Copy to Clipboard Toggle word wrap

The command returns results similar to the following example:

{
  "tr_0": {
    "alias": "tr_0",
    "f1_micro,none": 0.5633802816901409,
    "f1_micro_stderr,none": "N/A",
    "accuracy,none": 0.5633802816901409,
    "accuracy_stderr,none": "N/A",
    "f1_macro,none": 0.36036036036036034,
    "f1_macro_stderr,none": "N/A"
  }
}
Copy to Clipboard Toggle word wrap

Notes on the results

  • The f1_micro, f1_macro, and accuracy scores are 0.56, 0.36, and 0.56.
  • The full results are stored in the .status.results of the LMEvalJob object as a JSON document.
  • The command above only retrieves the results field of the JSON document.

LMEvalJob properties

The following table lists each property in the LMEvalJob and its usage:

Expand
Table 6.2. LM-Eval properties
ParameterDescription

model

Specifies which model type or provider is evaluated. This field directly maps to the --model argument of the lm-evaluation-harness. Supported model types and providers include:

  • hf: HuggingFace models
  • openai-completions: OpenAI Completions API models
  • openai-chat-completions: OpenAI Chat Completions API models
  • local-completions and local-chat-completions: OpenAI API-compatible servers
  • textsynth: TextSynth APIs

modelArgs

A list of paired name and value arguments for the model type. Each model type or provider supports different arguments. You can find further details in the models section of the LM Evaluation Harness library on GitHub.

  • hf (HuggingFace)
  • local-completions (An OpenAI API-compatible server)
  • local-chat-completions (An OpenAI API-compatible server)
  • openai-completions (OpenAI Completions API models)
  • openai-chat-completions (ChatCompletions API models)
  • textsynth (TextSynth APIs)

taskList.taskNames

Specifies a list of tasks supported by lm-evaluation-harness.

taskList.taskRecipes

Specifies the task using the Unitxt recipe format:

  • card: Use the name to specify a Unitxt card or custom for a custom card.

    • name: Specifies a Unitxt card from the Unitxt catalog. Use the card ID as the value. For example, the ID of the Wnli card is cards.wnli.
    • custom: Defines and uses a custom card. The value is a JSON object that contains the custom dataset. For more information about creating a custom card, see the Unitxt documentation on their website. If the dataset used by the custom card requires an API key from an environment variable or a persistent volume, configure the necessary resources in the pod field.
  • template: Specifies a Unitxt template from the Unitxt catalog. Use the template ID as the value.
  • task (optional): Specifies a Unitxt task from the Unitxt catalog. Use the task ID as the value. A Unitxt card has a predefined task. Only specify a value for this if you want to run a different task.
  • metrics (optional): Specifies a Unitxt task from the Unitxt catalog. Use the metric ID as the value. A Unitxt task has a set of pre-defined metrics. Only specify a set of metrics if you need different metrics.
  • format (optional): Specifies a Unitxt format from the Unitxt catalog. Use the format ID as the value.
  • loaderLimit (optional): Specifies the maximum number of instances per stream to be returned from the loader. You can use this parameter to reduce loading time in large datasets.
  • numDemos (optional): Number of few-shot to be used.
  • demosPoolSize (optional): Size of the few-shot pool.

numFewShot

Sets the number of few-shot examples to place in context. If you are using a task from Unitxt, do not use this field. Use numDemos under the taskRecipes instead.

limit

Set a limit to run the tasks instead of running the entire dataset. Accepts either an integer or a float between 0.0 and 1.0.

genArgs

Maps to the --gen_kwargs parameter for the lm-evaluation-harness. For more information, see the LM Evaluation Harness documentation on GitHub.

logSamples

If this flag is passed, then the model outputs and the text fed into the model will be saved at per-document granularity.

batchSize

Specifies the batch size for the evaluation in integer format. The auto:N batch size is not used for API models, but numeric batch sizes are used for APIs.

pod

Specifies extra information for the lm-eval job pod:

  • container: Specifies additional container settings for the lm-eval container.

    • env: Specifies environment variables. This parameter uses the EnvVar data structure of Kubernetes.
    • volumeMounts: Mounts the volumes into the lm-eval container.
    • resources: Specifies the resources for the lm-eval container.
  • volumes: Specifies the volume information for the lm-eval and other containers. This parameter uses the Volume data structure of Kubernetes.
  • sideCars: A list of containers that run along with the lm-eval container. It uses the Container data structure of Kubernetes.

outputs

This parameter defines a custom output location to store the the evaluation results. Only Persistent Volume Claims (PVC) are supported.

outputs.pvcManaged

Creates an operator-managed PVC to store the job results. The PVC is named <job-name>-pvc and is owned by the LMEvalJob. After the job finishes, the PVC is still be available, but it is deleted with the LMEvalJob. Supports the following fields:

  • size: The PVC size, compatible with standard PVC syntax (for example, 5Gi)

outputs.pvcName

Binds an existing PVC to a job by specifying its name. The PVC must be created separately and must already exist when creating the job.

allowOnline

If this parameter is set to true, the LMEval job downloads artifacts as needed (for example, models, datasets or tokenizers). If set to false, artifacts are not downloaded and are pulled from local storage instead. This setting is disabled by default. If you want to enable allowOnline mode, you can patch the TrustyAI operator ConfigMap.

allowCodeExecution

If this parameter is set to true, the LMEval job executes the necessary code for preparing models or datasets. If set to false it does not execute downloaded code.

offline

Mount a PVC as the local storage for models and datasets.

6.3. LM-Eval scenarios

The following procedures outline example scenarios that can be useful for an ML-Eval setup.

6.3.1. Configuring the LM-Eval environment

If the LMEvalJob needs to access a model on HuggingFace with the access token, you can set up the HF_TOKEN as one of the environment variables for the lm-eval container.

Prerequisites

  • You have logged in to Red Hat OpenShift AI.
  • Your OpenShift cluster administrator has installed OpenShift AI and enabled the TrustyAI service for the data science project where the models are deployed.

Procedure

  1. To start an evaluation job for a huggingface model, apply the following YAML file:

    apiVersion: trustyai.opendatahub.io/v1alpha1
    kind: LMEvalJob
    metadata:
      name: evaljob-sample
    spec:
      model: hf
      modelArgs:
      - name: pretrained
        value: huggingfacespace/model
      taskList:
        taskNames:
        - unfair_tos/
      logSamples: true
      pod:
        container:
          env:
          - name: HF_TOKEN
            value: "My HuggingFace token"
    Copy to Clipboard Toggle word wrap
  2. (Optional) You can also create a secret to store the token, then refer the key from the secretKeyRef object using the following reference syntax:

    env:
      - name: HF_TOKEN
        valueFrom:
          secretKeyRef:
            name: my-secret
            key: hf-token
    Copy to Clipboard Toggle word wrap

6.3.2. Using a custom Unitxt card

You can run evaluations using custom Unitxt cards. To do this, include the custom Unitxt card in JSON format within the LMEvalJob YAML.

Prerequisites

  • You have logged in to Red Hat OpenShift AI.
  • Your OpenShift cluster administrator has installed OpenShift AI and enabled the TrustyAI service for the data science project where the models are deployed.

Procedure

  1. Pass a custom Unitxt Card in JSON format:

    apiVersion: trustyai.opendatahub.io/v1alpha1
    kind: LMEvalJob
    metadata:
      name: evaljob-sample
    spec:
      model: hf
      modelArgs:
      - name: pretrained
        value: google/flan-t5-base
      taskList:
        taskRecipes:
        - template: "templates.classification.multi_class.relation.default"
          card:
            custom: |
              {
                "__type__": "task_card",
                "loader": {
                  "__type__": "load_hf",
                  "path": "glue",
                  "name": "wnli"
                },
                "preprocess_steps": [
                  {
                    "__type__": "split_random_mix",
                    "mix": {
                      "train": "train[95%]",
                      "validation": "train[5%]",
                      "test": "validation"
                    }
                  },
                  {
                    "__type__": "rename",
                    "field": "sentence1",
                    "to_field": "text_a"
                  },
                  {
                    "__type__": "rename",
                    "field": "sentence2",
                    "to_field": "text_b"
                  },
                  {
                    "__type__": "map_instance_values",
                    "mappers": {
                      "label": {
                        "0": "entailment",
                        "1": "not entailment"
                      }
                    }
                  },
                  {
                    "__type__": "set",
                    "fields": {
                      "classes": [
                        "entailment",
                        "not entailment"
                      ]
                    }
                  },
                  {
                    "__type__": "set",
                    "fields": {
                      "type_of_relation": "entailment"
                    }
                  },
                  {
                    "__type__": "set",
                    "fields": {
                      "text_a_type": "premise"
                    }
                  },
                  {
                    "__type__": "set",
                    "fields": {
                      "text_b_type": "hypothesis"
                    }
                  }
                ],
                "task": "tasks.classification.multi_class.relation",
                "templates": "templates.classification.multi_class.relation.all"
              }
      logSamples: true
    Copy to Clipboard Toggle word wrap
  2. Inside the custom card specify the Hugging Face dataset loader:

    "loader": {
                  "__type__": "load_hf",
                  "path": "glue",
                  "name": "wnli"
                },
    Copy to Clipboard Toggle word wrap
  3. (Optional) You can use other Unitxt loaders (found on the Unitxt website) that contain the volumes and volumeMounts parameters to mount the dataset from persistent volumes. For example, if you use the LoadCSV Unitxt command, mount the files to the container and make the dataset accessible for the evaluation process.

6.3.3. Using PVCs as storage

To use a PVC as storage for the LMEvalJob results, you can use either managed PVCS or existing PVCs. Managed PVCs are managed by the TrustyAI operator. Existing PVCs are created by the end-user before the LMEvalJob is created.

Note

If both managed and existing PVCs are referenced in outputs, the TrustyAI operator defaults to the managed PVC.

Prerequisites

  • You have logged in to Red Hat OpenShift AI.
  • Your OpenShift cluster administrator has installed OpenShift AI and enabled the TrustyAI service for the data science project where the models are deployed.

6.3.3.1. Managed PVCs

To create a managed PVC, specify its size. The managed PVC is named <job-name>-pvc and is available after the job finishes. When the LMEvalJob is deleted, the managed PVC is also deleted.

Procedure

  • Enter the following code:

    apiVersion: trustyai.opendatahub.io/v1alpha1
    kind: LMEvalJob
    metadata:
      name: evaljob-sample
    spec:
      # other fields omitted ...
      outputs:
        pvcManaged:
          size: 5Gi
    Copy to Clipboard Toggle word wrap

Notes on the code

  • outputs is the section for specifying custom storage locations
  • pvcManaged will create an operator-managed PVC
  • size (compatible with standard PVC syntax) is the only supported value

6.3.3.2. Existing PVCs

To use an existing PVC, pass its name as a reference. The PVC must exist when you create the LMEvalJob. The PVC is not managed by the TrustyAI operator, so it is available after deleting the LMEvalJob.

Procedure

  1. Create a PVC. An example is the following:

    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: "my-pvc"
    spec:
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 1Gi
    Copy to Clipboard Toggle word wrap
  2. Reference the new PVC from the LMEvalJob.

    apiVersion: trustyai.opendatahub.io/v1alpha1
    kind: LMEvalJob
    metadata:
      name: evaljob-sample
    spec:
      # other fields omitted ...
      outputs:
        pvcName: "my-pvc"
    Copy to Clipboard Toggle word wrap

6.3.4. Using an InferenceService

To run an evaluation job on an InferenceService which is already deployed and running in your namespace, define your LMEvalJob CR, then apply this CR into the same namespace as your model.

Prerequisites

  • You have logged in to Red Hat OpenShift AI.
  • Your OpenShift cluster administrator has installed OpenShift AI and enabled the TrustyAI service for the data science project where the models are deployed.
  • You have a namespace that contains an InferenceService with a vLLM model. This example assumes that the vLLM model is already deployed in your cluster.

Procedure

  1. Define your LMEvalJob CR:

      apiVersion: trustyai.opendatahub.io/v1alpha1
    kind: LMEvalJob
    metadata:
      name: evaljob
    spec:
      model: local-completions
      taskList:
        taskNames:
          - mmlu
      logSamples: true
      batchSize: 1
      modelArgs:
        - name: model
          value: granite
        - name: base_url
          value: $ROUTE_TO_MODEL/v1/completions
        - name: num_concurrent
          value:  "1"
        - name: max_retries
          value:  "3"
        - name: tokenized_requests
          value: "False"
        - name: tokenizer
          value: ibm-granite/granite-7b-instruct
     env:
       - name: OPENAI_TOKEN
         valueFrom:
              secretKeyRef:
                name: <secret-name>
                key: token
    Copy to Clipboard Toggle word wrap
  2. Apply this CR into the same namespace as your model.

Verification

A pod spins up in your model namespace called evaljob. In the pod terminal, you can see the output via tail -f output/stderr.log.

Notes on the code

  • base_url should be set to the route/service URL of your model. Make sure to include the /v1/completions endpoint in the URL.
  • env.valueFrom.secretKeyRef.name should point to a secret that contains a token that can authenticate to your model. secretRef.name should be the secret’s name in the namespace, while secretRef.key should point at the token’s key within the secret.
  • secretKeyRef.name can equal the output of:

    oc get secrets -o custom-columns=SECRET:.metadata.name --no-headers | grep user-one-token
    Copy to Clipboard Toggle word wrap
  • secretKeyRef.key is set to token
Back to top
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2025 Red Hat