Chapter 6. Evaluating large language models
A large language model (LLM) is a type of artificial intelligence (AI) program that is designed for natural language processing tasks, such as recognizing and generating text.
As a data scientist, you might want to monitor your large language models against a range of metrics, in order to ensure the accuracy and quality of its output. Features such as summarization, language toxicity, and question-answering accuracy can be assessed to inform and improve your model parameters.
Red Hat OpenShift AI now offers Language Model Evaluation as a Service (LM-Eval-aaS), in a feature called LM-Eval. LM-Eval provides a unified framework to test generative language models on a vast range of different evaluation tasks.
The following sections show you how to create an LMEvalJob custom resource (CR) which allows you to activate an evaluation job and generate an analysis of your model’s ability.
6.1. Setting up LM-Eval Copy linkLink copied to clipboard!
LM-Eval is a service designed for evaluating large language models that has been integrated into the TrustyAI Operator.
The service is built on top of two open-source projects:
- LM Evaluation Harness, developed by EleutherAI, that provides a comprehensive framework for evaluating language models
- Unitxt, a tool that enhances the evaluation process with additional functionalities
The following information explains how to create an LMEvalJob custom resource (CR) to initiate an evaluation job and get the results.
Global settings for LM-Eval
Configurable global settings for LM-Eval services are stored in the TrustyAI operator global ConfigMap, named trustyai-service-operator-config. The global settings are located in the same namespace as the operator.
You can configure the following properties for LM-Eval:
| Property | Default | Description |
|---|---|---|
|
|
|
Detect if there are GPUs available and assign a value for |
|
|
| The image for the LM-Eval job. The image contains the Python packages for LM Evaluation Harness and Unitxt. |
|
|
|
The image for the LM-Eval driver. For detailed information about the driver, see the |
|
|
| The image-pulling policy when running the evaluation job. |
|
| 8 | The default batch size when invoking the model inference API. Default batch size is only available for local models. |
|
| 24 | The maximum batch size that users can specify in an evaluation job. |
|
| 10s | The interval to check the job pod for an evaluation job. |
|
| true |
Whether LMEval jobs can set the online mode to |
|
| true |
Determines whether LMEval jobs can set the |
After updating the settings in the ConfigMap, restart the operator to apply the new values.
The allowOnline setting is disabled by default at the operator level in Red Hat OpenShift AI, as using allowOnline gives the job permissions to automatically download artifacts from external sources.
Enabling allowOnline mode
To enable allowOnline mode, patch the TrustyAI operator ConfigMap with the following code:
kubectl patch configmap trustyai-service-operator-config -n redhat-ods-applications \
--type merge -p '{"data":{"lmes-allow-online":"true","lmes-allow-code-execution":"true"}}'
kubectl patch configmap trustyai-service-operator-config -n redhat-ods-applications \
--type merge -p '{"data":{"lmes-allow-online":"true","lmes-allow-code-execution":"true"}}'
Then restart the TrustyAI operator with:
kubectl rollout restart deployment trustyai-service-operator-controller-manager -n redhat-ods-applications
kubectl rollout restart deployment trustyai-service-operator-controller-manager -n redhat-ods-applications
6.2. LM-Eval evaluation job Copy linkLink copied to clipboard!
LM-Eval service defines a new Custom Resource Definition (CRD) called LMEvalJob. An LMEvalJob object represents an evaluation job. LMEvalJob objects are monitored by the TrustyAI Kubernetes operator.
To run an evaluation job, create an LMEvalJob object with the following information: model, model arguments, task, and secret.
After the LMEvalJob is created, the LM-Eval service runs the evaluation job. The status and results of the LMEvalJob object update when the information is available.
Other TrustyAI features (such as bias and drift metrics) do not support non-tabular models (including LLMs). Deploying the TrustyAIService custom resource (CR) in a namespace that contains non-tabular models (such as the namespace where an evaluation job is being executed) can cause errors within the TrustyAI service.
Sample LMEvalJob object
The sample LMEvalJob object contains the following features:
-
The
google/flan-t5-basemodel from Hugging Face. -
The dataset from the
wnlicard, a subset of the GLUE (General Language Understanding Evaluation) benchmark evaluation framework from Hugging Face. For more information about thewnliUnitxt card, see the Unitxt website. -
The following default parameters for the
multi_class.relationUnitxt task:f1_micro,f1_macro, andaccuracy. This template can be found on the Unitxt website: click Catalog, then click Tasks and select Classification from the menu.
The following is an example of an LMEvalJob object:
After you apply the sample LMEvalJob, check its state by using the following command:
oc get lmevaljob evaljob-sample
oc get lmevaljob evaljob-sample
Output similar to the following appears: NAME: evaljob-sample STATE: Running
Evaluation results are available when the state of the object changes to Complete. Both the model and dataset in this example are small. The evaluation job should finish within 10 minutes on a CPU-only node.
Use the following command to get the results:
oc get lmevaljobs.trustyai.opendatahub.io evaljob-sample \
-o template --template={{.status.results}} | jq '.results'
oc get lmevaljobs.trustyai.opendatahub.io evaljob-sample \
-o template --template={{.status.results}} | jq '.results'
The command returns results similar to the following example:
Notes on the results
-
The
f1_micro,f1_macro, andaccuracyscores are 0.56, 0.36, and 0.56. -
The full results are stored in the
.status.resultsof theLMEvalJobobject as a JSON document. - The command above only retrieves the results field of the JSON document.
LMEvalJob properties
The following table lists each property in the LMEvalJob and its usage:
| Parameter | Description |
|---|---|
|
|
Specifies which model type or provider is evaluated. This field directly maps to the
|
|
| A list of paired name and value arguments for the model type. Each model type or provider supports different arguments. You can find further details in the models section of the LM Evaluation Harness library on GitHub.
|
|
|
Specifies a list of tasks supported by |
|
| Specifies the task using the Unitxt recipe format:
|
|
|
Sets the number of few-shot examples to place in context. If you are using a task from Unitxt, do not use this field. Use |
|
| Set a limit to run the tasks instead of running the entire dataset. Accepts either an integer or a float between 0.0 and 1.0. |
|
|
Maps to the |
|
| If this flag is passed, then the model outputs and the text fed into the model will be saved at per-document granularity. |
|
|
Specifies the batch size for the evaluation in integer format. The |
|
|
Specifies extra information for the
|
|
| This parameter defines a custom output location to store the the evaluation results. Only Persistent Volume Claims (PVC) are supported. |
|
|
Creates an operator-managed PVC to store the job results. The PVC is named
|
|
| Binds an existing PVC to a job by specifying its name. The PVC must be created separately and must already exist when creating the job. |
|
|
If this parameter is set to |
|
|
If this parameter is set to |
|
| Mount a PVC as the local storage for models and datasets. |
6.3. LM-Eval scenarios Copy linkLink copied to clipboard!
The following procedures outline example scenarios that can be useful for an ML-Eval setup.
6.3.1. Configuring the LM-Eval environment Copy linkLink copied to clipboard!
If the LMEvalJob needs to access a model on HuggingFace with the access token, you can set up the HF_TOKEN as one of the environment variables for the lm-eval container.
Prerequisites
- You have logged in to Red Hat OpenShift AI.
- Your OpenShift cluster administrator has installed OpenShift AI and enabled the TrustyAI service for the data science project where the models are deployed.
Procedure
To start an evaluation job for a
huggingfacemodel, apply the following YAML file:Copy to Clipboard Copied! Toggle word wrap Toggle overflow (Optional) You can also create a secret to store the token, then refer the key from the
secretKeyRefobject using the following reference syntax:Copy to Clipboard Copied! Toggle word wrap Toggle overflow
6.3.2. Using a custom Unitxt card Copy linkLink copied to clipboard!
You can run evaluations using custom Unitxt cards. To do this, include the custom Unitxt card in JSON format within the LMEvalJob YAML.
Prerequisites
- You have logged in to Red Hat OpenShift AI.
- Your OpenShift cluster administrator has installed OpenShift AI and enabled the TrustyAI service for the data science project where the models are deployed.
Procedure
Pass a custom Unitxt Card in JSON format:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Inside the custom card specify the Hugging Face dataset loader:
"loader": { "__type__": "load_hf", "path": "glue", "name": "wnli" },"loader": { "__type__": "load_hf", "path": "glue", "name": "wnli" },Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
(Optional) You can use other Unitxt loaders (found on the Unitxt website) that contain the
volumesandvolumeMountsparameters to mount the dataset from persistent volumes. For example, if you use theLoadCSVUnitxt command, mount the files to the container and make the dataset accessible for the evaluation process.
6.3.3. Using PVCs as storage Copy linkLink copied to clipboard!
To use a PVC as storage for the LMEvalJob results, you can use either managed PVCS or existing PVCs. Managed PVCs are managed by the TrustyAI operator. Existing PVCs are created by the end-user before the LMEvalJob is created.
If both managed and existing PVCs are referenced in outputs, the TrustyAI operator defaults to the managed PVC.
Prerequisites
- You have logged in to Red Hat OpenShift AI.
- Your OpenShift cluster administrator has installed OpenShift AI and enabled the TrustyAI service for the data science project where the models are deployed.
6.3.3.1. Managed PVCs Copy linkLink copied to clipboard!
To create a managed PVC, specify its size. The managed PVC is named <job-name>-pvc and is available after the job finishes. When the LMEvalJob is deleted, the managed PVC is also deleted.
Procedure
Enter the following code:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Notes on the code
-
outputsis the section for specifying custom storage locations -
pvcManagedwill create an operator-managed PVC -
size(compatible with standard PVC syntax) is the only supported value
6.3.3.2. Existing PVCs Copy linkLink copied to clipboard!
To use an existing PVC, pass its name as a reference. The PVC must exist when you create the LMEvalJob. The PVC is not managed by the TrustyAI operator, so it is available after deleting the LMEvalJob.
Procedure
Create a PVC. An example is the following:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Reference the new PVC from the
LMEvalJob.Copy to Clipboard Copied! Toggle word wrap Toggle overflow
6.3.4. Using an InferenceService Copy linkLink copied to clipboard!
To run an evaluation job on an InferenceService which is already deployed and running in your namespace, define your LMEvalJob CR, then apply this CR into the same namespace as your model.
Prerequisites
- You have logged in to Red Hat OpenShift AI.
- Your OpenShift cluster administrator has installed OpenShift AI and enabled the TrustyAI service for the data science project where the models are deployed.
- You have a namespace that contains an InferenceService with a vLLM model. This example assumes that the vLLM model is already deployed in your cluster.
Procedure
Define your
LMEvalJobCR:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Apply this CR into the same namespace as your model.
Verification
A pod spins up in your model namespace called evaljob. In the pod terminal, you can see the output via tail -f output/stderr.log.
Notes on the code
-
base_urlshould be set to the route/service URL of your model. Make sure to include the/v1/completionsendpoint in the URL. -
env.valueFrom.secretKeyRef.nameshould point to a secret that contains a token that can authenticate to your model.secretRef.nameshould be the secret’s name in the namespace, whilesecretRef.keyshould point at the token’s key within the secret. secretKeyRef.namecan equal the output of:oc get secrets -o custom-columns=SECRET:.metadata.name --no-headers | grep user-one-token
oc get secrets -o custom-columns=SECRET:.metadata.name --no-headers | grep user-one-tokenCopy to Clipboard Copied! Toggle word wrap Toggle overflow -
secretKeyRef.keyis set totoken