Chapter 6. Evaluating large language models

A large language model (LLM) is a type of artificial intelligence (AI) program that is designed for natural language processing tasks, such as recognizing and generating text.

As a data scientist, you might want to monitor your large language models against a range of metrics, in order to ensure the accuracy and quality of its output. Features such as summarization, language toxicity, and question-answering accuracy can be assessed to inform and improve your model parameters.

Red Hat OpenShift AI now offers Language Model Evaluation as a Service (LM-Eval-aaS), in a feature called LM-Eval. LM-Eval provides a unified framework to test generative language models on a vast range of different evaluation tasks.

The following sections show you how to create an LMEvalJob custom resource (CR) which allows you to activate an evaluation job and generate an analysis of your model’s ability.

6.1. Setting up LM-Eval
Copy link

LM-Eval is a service designed for evaluating large language models that has been integrated into the TrustyAI Operator.

The service is built on top of two open-source projects:

LM Evaluation Harness, developed by EleutherAI, that provides a comprehensive framework for evaluating language models
Unitxt, a tool that enhances the evaluation process with additional functionalities

The following information explains how to create an LMEvalJob custom resource (CR) to initiate an evaluation job and get the results.

Global settings for LM-Eval

Configurable global settings for LM-Eval services are stored in the TrustyAI operator global ConfigMap, named trustyai-service-operator-config. The global settings are located in the same namespace as the operator.

You can configure the following properties for LM-Eval:

Expand

Table 6.1. LM-Eval properties
Property	Default	Description
`lmes-detect-device`	`true/false`	Detect if there are GPUs available and assign a value for `--device argument` for LM Evaluation Harness. If GPUs are available, the value is `cuda`. If there are no GPUs available, the value is `cpu`.
`lmes-pod-image`	`quay.io/trustyai/ta-lmes-job:latest`	The image for the LM-Eval job. The image contains the Python packages for LM Evaluation Harness and Unitxt.
`lmes-driver-image`	`quay.io/trustyai/ta-lmes-driver:latest`	The image for the LM-Eval driver. For detailed information about the driver, see the `cmd/lmes_driver` directory.
`lmes-image-pull-policy`	`Always`	The image-pulling policy when running the evaluation job.
`lmes-default-batch-size`	8	The default batch size when invoking the model inference API. Default batch size is only available for local models.
`lmes-max-batch-size`	24	The maximum batch size that users can specify in an evaluation job.
`lmes-pod-checking-interval`	10s	The interval to check the job pod for an evaluation job.
`lmes-allow-online`	true	Whether LMEval jobs can set the online mode to `on` to access artifacts (models, datasets, tokenizers) from the internet.
`lmes-code-execution`	true	Determines whether LMEval jobs can set the `trust remote code` mode to `on`.

After updating the settings in the ConfigMap, restart the operator to apply the new values.

Important

The allowOnline setting is disabled by default at the operator level in Red Hat OpenShift AI, as using allowOnline gives the job permissions to automatically download artifacts from external sources.

Enabling allowOnline mode

To enable allowOnline mode, patch the TrustyAI operator ConfigMap with the following code:

 kubectl patch configmap trustyai-service-operator-config -n redhat-ods-applications \
--type merge -p '{"data":{"lmes-allow-online":"true","lmes-allow-code-execution":"true"}}'

 kubectl patch configmap trustyai-service-operator-config -n redhat-ods-applications \
--type merge -p '{"data":{"lmes-allow-online":"true","lmes-allow-code-execution":"true"}}'

Copy to Clipboard

Toggle word wrap

Then restart the TrustyAI operator with:

kubectl rollout restart deployment trustyai-service-operator-controller-manager -n redhat-ods-applications

kubectl rollout restart deployment trustyai-service-operator-controller-manager -n redhat-ods-applications

Copy to Clipboard

Toggle word wrap

6.2. LM-Eval evaluation job
Copy link

LM-Eval service defines a new Custom Resource Definition (CRD) called LMEvalJob. An LMEvalJob object represents an evaluation job. LMEvalJob objects are monitored by the TrustyAI Kubernetes operator.

To run an evaluation job, create an LMEvalJob object with the following information: model, model arguments, task, and secret.

After the LMEvalJob is created, the LM-Eval service runs the evaluation job. The status and results of the LMEvalJob object update when the information is available.

Note

Other TrustyAI features (such as bias and drift metrics) do not support non-tabular models (including LLMs). Deploying the TrustyAIService custom resource (CR) in a namespace that contains non-tabular models (such as the namespace where an evaluation job is being executed) can cause errors within the TrustyAI service.

Sample LMEvalJob object

The sample LMEvalJob object contains the following features:

The google/flan-t5-base model from Hugging Face.
The dataset from the wnli card, a subset of the GLUE (General Language Understanding Evaluation) benchmark evaluation framework from Hugging Face. For more information about the wnli Unitxt card, see the Unitxt website.
The following default parameters for the multi_class.relation Unitxt task: f1_micro, f1_macro, and accuracy. This template can be found on the Unitxt website: click Catalog, then click Tasks and select Classification from the menu.

The following is an example of an LMEvalJob object:

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  model: hf
  modelArgs:
  - name: pretrained
    value: google/flan-t5-base
  taskList:
    taskRecipes:
    - card:
        name: "cards.wnli"
      template: "templates.classification.multi_class.relation.default"
  logSamples: true

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  model: hf
  modelArgs:
  - name: pretrained
    value: google/flan-t5-base
  taskList:
    taskRecipes:
    - card:
        name: "cards.wnli"
      template: "templates.classification.multi_class.relation.default"
  logSamples: true

Copy to Clipboard

Toggle word wrap

After you apply the sample LMEvalJob, check its state by using the following command:

oc get lmevaljob evaljob-sample

oc get lmevaljob evaljob-sample

Copy to Clipboard

Toggle word wrap

Output similar to the following appears: NAME: evaljob-sample STATE: Running

Evaluation results are available when the state of the object changes to Complete. Both the model and dataset in this example are small. The evaluation job should finish within 10 minutes on a CPU-only node.

Use the following command to get the results:

oc get lmevaljobs.trustyai.opendatahub.io evaljob-sample \
  -o template --template={{.status.results}} | jq '.results'

oc get lmevaljobs.trustyai.opendatahub.io evaljob-sample \
  -o template --template={{.status.results}} | jq '.results'

Copy to Clipboard

Toggle word wrap

The command returns results similar to the following example:

{
  "tr_0": {
    "alias": "tr_0",
    "f1_micro,none": 0.5633802816901409,
    "f1_micro_stderr,none": "N/A",
    "accuracy,none": 0.5633802816901409,
    "accuracy_stderr,none": "N/A",
    "f1_macro,none": 0.36036036036036034,
    "f1_macro_stderr,none": "N/A"
  }
}

{
  "tr_0": {
    "alias": "tr_0",
    "f1_micro,none": 0.5633802816901409,
    "f1_micro_stderr,none": "N/A",
    "accuracy,none": 0.5633802816901409,
    "accuracy_stderr,none": "N/A",
    "f1_macro,none": 0.36036036036036034,
    "f1_macro_stderr,none": "N/A"
  }
}

Copy to Clipboard

Toggle word wrap

Notes on the results

The f1_micro, f1_macro, and accuracy scores are 0.56, 0.36, and 0.56.
The full results are stored in the .status.results of the LMEvalJob object as a JSON document.
The command above only retrieves the results field of the JSON document.

LMEvalJob properties

The following table lists each property in the LMEvalJob and its usage:

Expand

Table 6.2. LM-Eval properties
Parameter	Description
`model`	Specifies which model type or provider is evaluated. This field directly maps to the `--model` argument of the `lm-evaluation-harness`. Supported model types and providers include: `hf`: HuggingFace models `openai-completions`: OpenAI Completions API models `openai-chat-completions`: OpenAI Chat Completions API models `local-completions` and `local-chat-completions`: OpenAI API-compatible servers `textsynth`: TextSynth APIs
`modelArgs`	A list of paired name and value arguments for the model type. Each model type or provider supports different arguments. You can find further details in the models section of the LM Evaluation Harness library on GitHub. `hf` (HuggingFace) `local-completions` (An OpenAI API-compatible server) `local-chat-completions` (An OpenAI API-compatible server) `openai-completions` (OpenAI Completions API models) `openai-chat-completions` (ChatCompletions API models) `textsynth` (TextSynth APIs)
`taskList.taskNames`	Specifies a list of tasks supported by `lm-evaluation-harness`.
`taskList.taskRecipes`	Specifies the task using the Unitxt recipe format: `card`: Use the `name` to specify a Unitxt card or `custom` for a custom card. `name`: Specifies a Unitxt card from the Unitxt catalog. Use the card ID as the value. For example, the ID of the Wnli card is `cards.wnli`. `custom`: Defines and uses a custom card. The value is a JSON object that contains the custom dataset. For more information about creating a custom card, see the Unitxt documentation on their website. If the dataset used by the custom card requires an API key from an environment variable or a persistent volume, configure the necessary resources in the `pod` field. `template`: Specifies a Unitxt template from the Unitxt catalog. Use the template ID as the value. `task` (optional): Specifies a Unitxt task from the Unitxt catalog. Use the task ID as the value. A Unitxt card has a predefined task. Only specify a value for this if you want to run a different task. `metrics` (optional): Specifies a Unitxt task from the Unitxt catalog. Use the metric ID as the value. A Unitxt task has a set of pre-defined metrics. Only specify a set of metrics if you need different metrics. `format` (optional): Specifies a Unitxt format from the Unitxt catalog. Use the format ID as the value. `loaderLimit` (optional): Specifies the maximum number of instances per stream to be returned from the loader. You can use this parameter to reduce loading time in large datasets. `numDemos` (optional): Number of few-shot to be used. `demosPoolSize` (optional): Size of the few-shot pool.
`numFewShot`	Sets the number of few-shot examples to place in context. If you are using a task from Unitxt, do not use this field. Use `numDemos` under the `taskRecipes` instead.
`limit`	Set a limit to run the tasks instead of running the entire dataset. Accepts either an integer or a float between 0.0 and 1.0.
`genArgs`	Maps to the `--gen_kwargs` parameter for the `lm-evaluation-harness`. For more information, see the LM Evaluation Harness documentation on GitHub.
`logSamples`	If this flag is passed, then the model outputs and the text fed into the model will be saved at per-document granularity.
`batchSize`	Specifies the batch size for the evaluation in integer format. The `auto:N` batch size is not used for API models, but numeric batch sizes are used for APIs.
`pod`	Specifies extra information for the `lm-eval` job pod: `container`: Specifies additional container settings for the `lm-eval` container. `env`: Specifies environment variables. This parameter uses the `EnvVar` data structure of Kubernetes. `volumeMounts`: Mounts the volumes into the `lm-eval` container. `resources`: Specifies the resources for the `lm-eval` container. `volumes`: Specifies the volume information for the `lm-eval` and other containers. This parameter uses the `Volume` data structure of Kubernetes. `sideCars`: A list of containers that run along with the `lm-eval` container. It uses the `Container` data structure of Kubernetes.
`outputs`	This parameter defines a custom output location to store the the evaluation results. Only Persistent Volume Claims (PVC) are supported.
`outputs.pvcManaged`	Creates an operator-managed PVC to store the job results. The PVC is named `<job-name>-pvc` and is owned by the `LMEvalJob`. After the job finishes, the PVC is still be available, but it is deleted with the `LMEvalJob`. Supports the following fields: `size`: The PVC size, compatible with standard PVC syntax (for example, 5Gi)
`outputs.pvcName`	Binds an existing PVC to a job by specifying its name. The PVC must be created separately and must already exist when creating the job.
`allowOnline`	If this parameter is set to `true`, the LMEval job downloads artifacts as needed (for example, models, datasets or tokenizers). If set to `false`, artifacts are not downloaded and are pulled from local storage instead. This setting is disabled by default. If you want to enable `allowOnline` mode, you can patch the TrustyAI operator `ConfigMap`.
`allowCodeExecution`	If this parameter is set to `true`, the LMEval job executes the necessary code for preparing models or datasets. If set to `false` it does not execute downloaded code.
`offline`	Mount a PVC as the local storage for models and datasets.

6.3. LM-Eval scenarios
Copy link

The following procedures outline example scenarios that can be useful for an ML-Eval setup.

6.3.1. Configuring the LM-Eval environment
Copy link

If the LMEvalJob needs to access a model on HuggingFace with the access token, you can set up the HF_TOKEN as one of the environment variables for the lm-eval container.

Prerequisites

You have logged in to Red Hat OpenShift AI.
Your OpenShift cluster administrator has installed OpenShift AI and enabled the TrustyAI service for the data science project where the models are deployed.

Procedure

To start an evaluation job for a huggingface model, apply the following YAML file:

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  model: hf
  modelArgs:
  - name: pretrained
    value: huggingfacespace/model
  taskList:
    taskNames:
    - unfair_tos/
  logSamples: true
  pod:
    container:
      env:
      - name: HF_TOKEN
        value: "My HuggingFace token"

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  model: hf
  modelArgs:
  - name: pretrained
    value: huggingfacespace/model
  taskList:
    taskNames:
    - unfair_tos/
  logSamples: true
  pod:
    container:
      env:
      - name: HF_TOKEN
        value: "My HuggingFace token"

Copy to Clipboard

Toggle word wrap

(Optional) You can also create a secret to store the token, then refer the key from the secretKeyRef object using the following reference syntax:

env:
  - name: HF_TOKEN
    valueFrom:
      secretKeyRef:
        name: my-secret
        key: hf-token

env:
  - name: HF_TOKEN
    valueFrom:
      secretKeyRef:
        name: my-secret
        key: hf-token

Copy to Clipboard

Toggle word wrap

6.3.2. Using a custom Unitxt card
Copy link

You can run evaluations using custom Unitxt cards. To do this, include the custom Unitxt card in JSON format within the LMEvalJob YAML.

Prerequisites

You have logged in to Red Hat OpenShift AI.
Your OpenShift cluster administrator has installed OpenShift AI and enabled the TrustyAI service for the data science project where the models are deployed.

Procedure

Pass a custom Unitxt Card in JSON format:

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  model: hf
  modelArgs:
  - name: pretrained
    value: google/flan-t5-base
  taskList:
    taskRecipes:
    - template: "templates.classification.multi_class.relation.default"
      card:
        custom: |
          {
            "__type__": "task_card",
            "loader": {
              "__type__": "load_hf",
              "path": "glue",
              "name": "wnli"
            },
            "preprocess_steps": [
              {
                "__type__": "split_random_mix",
                "mix": {
                  "train": "train[95%]",
                  "validation": "train[5%]",
                  "test": "validation"
                }
              },
              {
                "__type__": "rename",
                "field": "sentence1",
                "to_field": "text_a"
              },
              {
                "__type__": "rename",
                "field": "sentence2",
                "to_field": "text_b"
              },
              {
                "__type__": "map_instance_values",
                "mappers": {
                  "label": {
                    "0": "entailment",
                    "1": "not entailment"
                  }
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "classes": [
                    "entailment",
                    "not entailment"
                  ]
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "type_of_relation": "entailment"
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "text_a_type": "premise"
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "text_b_type": "hypothesis"
                }
              }
            ],
            "task": "tasks.classification.multi_class.relation",
            "templates": "templates.classification.multi_class.relation.all"
          }
  logSamples: true

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  model: hf
  modelArgs:
  - name: pretrained
    value: google/flan-t5-base
  taskList:
    taskRecipes:
    - template: "templates.classification.multi_class.relation.default"
      card:
        custom: |
          {
            "__type__": "task_card",
            "loader": {
              "__type__": "load_hf",
              "path": "glue",
              "name": "wnli"
            },
            "preprocess_steps": [
              {
                "__type__": "split_random_mix",
                "mix": {
                  "train": "train[95%]",
                  "validation": "train[5%]",
                  "test": "validation"
                }
              },
              {
                "__type__": "rename",
                "field": "sentence1",
                "to_field": "text_a"
              },
              {
                "__type__": "rename",
                "field": "sentence2",
                "to_field": "text_b"
              },
              {
                "__type__": "map_instance_values",
                "mappers": {
                  "label": {
                    "0": "entailment",
                    "1": "not entailment"
                  }
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "classes": [
                    "entailment",
                    "not entailment"
                  ]
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "type_of_relation": "entailment"
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "text_a_type": "premise"
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "text_b_type": "hypothesis"
                }
              }
            ],
            "task": "tasks.classification.multi_class.relation",
            "templates": "templates.classification.multi_class.relation.all"
          }
  logSamples: true

Copy to Clipboard

Toggle word wrap

Inside the custom card specify the Hugging Face dataset loader:

"loader": {
              "__type__": "load_hf",
              "path": "glue",
              "name": "wnli"
            },

"loader": {
              "__type__": "load_hf",
              "path": "glue",
              "name": "wnli"
            },

Copy to Clipboard

Toggle word wrap

(Optional) You can use other Unitxt loaders (found on the Unitxt website) that contain the volumes and volumeMounts parameters to mount the dataset from persistent volumes. For example, if you use the LoadCSV Unitxt command, mount the files to the container and make the dataset accessible for the evaluation process.

6.3.3. Using PVCs as storage
Copy link

To use a PVC as storage for the LMEvalJob results, you can use either managed PVCS or existing PVCs. Managed PVCs are managed by the TrustyAI operator. Existing PVCs are created by the end-user before the LMEvalJob is created.

Note

If both managed and existing PVCs are referenced in outputs, the TrustyAI operator defaults to the managed PVC.

Prerequisites

You have logged in to Red Hat OpenShift AI.
Your OpenShift cluster administrator has installed OpenShift AI and enabled the TrustyAI service for the data science project where the models are deployed.

6.3.3.1. Managed PVCs
Copy link

To create a managed PVC, specify its size. The managed PVC is named <job-name>-pvc and is available after the job finishes. When the LMEvalJob is deleted, the managed PVC is also deleted.

Procedure

Enter the following code:

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  # other fields omitted ...
  outputs:
    pvcManaged:
      size: 5Gi

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  # other fields omitted ...
  outputs:
    pvcManaged:
      size: 5Gi

Copy to Clipboard

Toggle word wrap

Notes on the code

outputs is the section for specifying custom storage locations
pvcManaged will create an operator-managed PVC
size (compatible with standard PVC syntax) is the only supported value

6.3.3.2. Existing PVCs
Copy link

To use an existing PVC, pass its name as a reference. The PVC must exist when you create the LMEvalJob. The PVC is not managed by the TrustyAI operator, so it is available after deleting the LMEvalJob.

Procedure

Create a PVC. An example is the following:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: "my-pvc"
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: "my-pvc"
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

Copy to Clipboard

Toggle word wrap

Reference the new PVC from the LMEvalJob.

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  # other fields omitted ...
  outputs:
    pvcName: "my-pvc"

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  # other fields omitted ...
  outputs:
    pvcName: "my-pvc"

Copy to Clipboard

Toggle word wrap

6.3.4. Using an InferenceService
Copy link

To run an evaluation job on an InferenceService which is already deployed and running in your namespace, define your LMEvalJob CR, then apply this CR into the same namespace as your model.

Prerequisites

You have logged in to Red Hat OpenShift AI.
Your OpenShift cluster administrator has installed OpenShift AI and enabled the TrustyAI service for the data science project where the models are deployed.
You have a namespace that contains an InferenceService with a vLLM model. This example assumes that the vLLM model is already deployed in your cluster.

Procedure

Define your LMEvalJob CR:

  apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob
spec:
  model: local-completions
  taskList:
    taskNames:
      - mmlu
  logSamples: true
  batchSize: 1
  modelArgs:
    - name: model
      value: granite
    - name: base_url
      value: $ROUTE_TO_MODEL/v1/completions
    - name: num_concurrent
      value:  "1"
    - name: max_retries
      value:  "3"
    - name: tokenized_requests
      value: "False"
    - name: tokenizer
      value: ibm-granite/granite-7b-instruct
 env:
   - name: OPENAI_TOKEN
     valueFrom:
          secretKeyRef:
            name: <secret-name>
            key: token

  apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob
spec:
  model: local-completions
  taskList:
    taskNames:
      - mmlu
  logSamples: true
  batchSize: 1
  modelArgs:
    - name: model
      value: granite
    - name: base_url
      value: $ROUTE_TO_MODEL/v1/completions
    - name: num_concurrent
      value:  "1"
    - name: max_retries
      value:  "3"
    - name: tokenized_requests
      value: "False"
    - name: tokenizer
      value: ibm-granite/granite-7b-instruct
 env:
   - name: OPENAI_TOKEN
     valueFrom:
          secretKeyRef:
            name: <secret-name>
            key: token

Copy to Clipboard

Toggle word wrap

Apply this CR into the same namespace as your model.

Verification

A pod spins up in your model namespace called evaljob. In the pod terminal, you can see the output via tail -f output/stderr.log.

Notes on the code

base_url should be set to the route/service URL of your model. Make sure to include the /v1/completions endpoint in the URL.
env.valueFrom.secretKeyRef.name should point to a secret that contains a token that can authenticate to your model. secretRef.name should be the secret’s name in the namespace, while secretRef.key should point at the token’s key within the secret.

secretKeyRef.name can equal the output of:

oc get secrets -o custom-columns=SECRET:.metadata.name --no-headers | grep user-one-token

oc get secrets -o custom-columns=SECRET:.metadata.name --no-headers | grep user-one-token

Copy to Clipboard

Toggle word wrap

secretKeyRef.key is set to token

Chapter 6. Evaluating large language models

6.1. Setting up LM-Eval
Copy link

6.2. LM-Eval evaluation job
Copy link

6.3. LM-Eval scenarios
Copy link

6.3.1. Configuring the LM-Eval environment
Copy link

6.3.2. Using a custom Unitxt card
Copy link

6.3.3. Using PVCs as storage
Copy link

6.3.3.1. Managed PVCs
Copy link

6.3.3.2. Existing PVCs
Copy link

6.3.4. Using an InferenceService
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 6. Evaluating large language models

6.1. Setting up LM-EvalCopy linkLink copied to clipboard!

6.2. LM-Eval evaluation jobCopy linkLink copied to clipboard!

6.3. LM-Eval scenariosCopy linkLink copied to clipboard!

6.3.1. Configuring the LM-Eval environmentCopy linkLink copied to clipboard!

6.3.2. Using a custom Unitxt cardCopy linkLink copied to clipboard!

6.3.3. Using PVCs as storageCopy linkLink copied to clipboard!

6.3.3.1. Managed PVCsCopy linkLink copied to clipboard!

6.3.3.2. Existing PVCsCopy linkLink copied to clipboard!

6.3.4. Using an InferenceServiceCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

6.1. Setting up LM-Eval
Copy link

6.2. LM-Eval evaluation job
Copy link

6.3. LM-Eval scenarios
Copy link

6.3.1. Configuring the LM-Eval environment
Copy link

6.3.2. Using a custom Unitxt card
Copy link

6.3.3. Using PVCs as storage
Copy link

6.3.3.1. Managed PVCs
Copy link

6.3.3.2. Existing PVCs
Copy link

6.3.4. Using an InferenceService
Copy link