Chapter 5. Evaluate LLMs with EvalHub

5.1. Understanding EvalHub
Copiar enlace

EvalHub is an evaluation orchestration service for large language models (LLMs) on Red Hat OpenShift AI. It provides a versioned REST API for submitting evaluation jobs, managing benchmark providers, and tracking results through MLflow experiment tracking. Each evaluation runs as an isolated Job, enabling parallel execution and horizontal scalability across namespaces and tenants.

EvalHub consists of three components:

EvalHub Server — A REST API service that handles evaluation workflows, job orchestration, and provider management, with PostgreSQL storage.
EvalHub SDK and CLI — A Python client library and command-line tool for submitting evaluations and building framework adapters. The CLI provides the evalhub command for interacting with EvalHub from the terminal.
Providers — Evaluation framework adapters packaged as container images. Each provider translates EvalHub job requests into evaluation framework-specific commands and reports results back to the server.

5.1.1. Core concepts
Copiar enlace

The following concepts are central to EvalHub.

Providers

A provider represents an evaluation framework, such as lm_evaluation_harness, garak, guidellm, or lighteval. Each provider includes a set of benchmarks. EvalHub includes built-in providers that are read-only.

Benchmarks

A benchmark is a specific evaluation task within a provider. For example, the lm_evaluation_harness provider includes benchmarks such as mmlu, hellaswag, arc_challenge, and gsm8k. Each benchmark has a category such as math, reasoning, safety, or code, along with metrics and optional pass criteria.

Collections

A collection groups benchmarks from one or more providers into a reusable evaluation suite. For example, a safety-and-fairness-v1 collection might combine safety benchmarks from lm_evaluation_harness with vulnerability scans from garak.

Pass criteria and thresholds

Pass criteria define the minimum score that a benchmark or job must achieve to pass. Thresholds can be set at three levels, from most to least specific:

Benchmark level — You set a benchmark-level threshold per benchmark in a job submission or collection definition. This overrides all other thresholds.
Collection level — A collection-level threshold applies to all benchmarks in the collection that do not have their own threshold.
Provider level — A provider-level threshold is the default threshold defined in the provider’s benchmark configuration.
Each benchmark declares a primary score metric, such as acc_norm or toxicity_score, and optionally a lower_is_better flag. When lower_is_better is false (the default), the benchmark passes if the score is greater than or equal to the threshold. When lower_is_better is true, it passes if the score is less than or equal to the threshold.
Each benchmark in a collection or job can be assigned a weight that controls its relative importance in the overall score. At the job level, EvalHub computes a weighted average of all benchmark primary scores and compares it against the job-level threshold to determine an overall pass or fail result.

Evaluation jobs

An evaluation job represents a single evaluation run against a model. A job references either a list of benchmarks or a collection, a model endpoint, and optional MLflow experiment configuration. Jobs progress through states: pending, running, completed, failed, cancelled, or partially_failed.

Adapters

An adapter wraps an evaluation framework, such as lm_evaluation_harness, and implements the FrameworkAdapter interface so that EvalHub can orchestrate the evaluation. Adapters are packaged as Red Hat Universal Base Image 9 (UBI9) container images.

5.2. EvalHub architecture overview
Copiar enlace

When you submit an evaluation job, EvalHub follows this workflow:

The client submits a job through the REST API or SDK.
The server validates the request, resolves benchmarks, and persists the job with a status of pending.
The runtime creates a Kubernetes Job for each benchmark. Each Job pod contains two containers:
- The adapter container runs the evaluation framework. Adapters are provider-specific container images that implement a standard interface, translating the job specification into the evaluation framework-specific invocations and returning structured results.
- The sidecar proxy container authenticates to the EvalHub server using a ServiceAccount token and forwards status events and results from the adapter. The sidecar also proxies authenticated requests to MLflow and OCI registries when configured. This design keeps credentials out of the adapter container, which can run custom user-provided code.
The adapter runs the evaluation and reports status events back to EvalHub through the sidecar.
The server aggregates and stores the results. If MLflow integration is enabled, the server also logs the results to MLflow.

5.3. Deploy EvalHub with the TrustyAI Operator
Copiar enlace

Deploy EvalHub through the TrustyAI Operator as part of the OpenShift AI.

Prerequisites

You have cluster administrator privileges for your OpenShift cluster.
You have installed the OpenShift CLI (oc) version 4.12 or later.
You have the TrustyAI component in your OpenShift AI DataScienceCluster set to Managed.
You have configured KServe to use RawDeployment mode.

Procedure

Create a Secret containing the PostgreSQL connection string. The Secret must contain a db-url key with a valid PostgreSQL connection URI:
```
apiVersion: v1
kind: Secret
metadata:
  name: evalhub-db-credentials
type: Opaque
stringData:
  db-url: "postgres://evalhub:changeme@postgresql.evalhub.svc.cluster.local:5432/evalhub"
```
Note
Replace the hostname, credentials including the changeme placeholder, and database name to match your PostgreSQL deployment.
```
$ oc apply -f evalhub-db-credentials.yaml -n <namespace>
```

Create an EvalHub custom resource to deploy the service:

Example evalhub_cr.yaml

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: EvalHub
metadata:
  name: evalhub
spec:
  replicas: 1
  database:
    type: postgresql
    secret: evalhub-db-credentials
  providers:
    - lm_evaluation_harness
    - garak
    - guidellm
  collections:
    - safety-and-fairness-v1
  env:
    - name: MLFLOW_TRACKING_URI
      value: "http://mlflow.mlflow.svc.cluster.local:5000"

Expand

Table 5.1. EvalHub custom resource parameters
Parameter	Description
`replicas`	The number of EvalHub pods to create.
`database.type`	Storage backend. Set to `postgresql` for PostgreSQL.
`database.secret`	Name of a Secret containing the PostgreSQL connection string.
`providers`	List of evaluation provider configurations to load at startup.
`collections`	List of benchmark collections to load at startup.
`otel`	Optional: OpenTelemetry exporter configuration for traces and metrics.
`env`	Environment variables to set in the EvalHub deployment containers.

Apply the custom resource to the cluster:
```
$ oc apply -f evalhub_cr.yaml -n <namespace>
```
Note
Use a dedicated namespace for EvalHub rather than redhat-ods-applications. The redhat-ods-applications namespace has NetworkPolicies that restrict cross-namespace traffic, which requires additional labeling on tenant namespaces. For more information, see Section 5.23, “Set up a tenant namespace”.

The TrustyAI Operator automatically reconciles the EvalHub custom resource in your namespace.

Verification

Confirm that the EvalHub pod is running:

$ oc get pods -l app=eval-hub -n <namespace>

Example output

NAME                       READY   STATUS    RESTARTS   AGE
evalhub-7b9f4c6d88-x2k4p  1/1     Running   0          2m

Query the health endpoint:

$ export EVALHUB_URL=https://$(oc get routes evalhub -o jsonpath='{.spec.host}' -n <namespace>)
$ curl $EVALHUB_URL/api/v1/health | jq .

Example response

{
  "status": "healthy",
  "timestamp": "2026-04-13T10:00:00Z",
  "version": "0.3.0",
  "uptime": 3600000000000,
  "active_evaluations": 0
}

Install the EvalHub Python SDK to interact with the server. To install the SDK client library, run the following command:
```
$ pip install "eval-hub-sdk[client]"
```
To also include the CLI, run the following command:
```
$ pip install "eval-hub-sdk[cli]"
```

5.4. EvalHub multi-tenancy
Copiar enlace

EvalHub is a multi-tenant service. All API requests, except requests to /api/v1/health, must include the X-Tenant header, which identifies the target namespace. Resources such as jobs, providers, and collections are scoped to the tenant specified in this header. For information about setting up tenant namespaces and granting access, see Section 5.22, “EvalHub multi-tenancy and RBAC”.

When using curl, include the -H "X-Tenant: <namespace>" header in each request.

When using the Python SDK, set the tenant at client initialization:

from evalhub import SyncEvalHubClient

client = SyncEvalHubClient(
    base_url="https://evalhub.example.com",
    tenant="my-namespace"
)

When using the CLI, configure the tenant in your connection profile. The CLI stores connection settings in named profiles at ~/.config/evalhub/config.yaml. Settings are persistent across commands. Use --profile <name> to override the active profile at runtime.

$ evalhub config set tenant my-namespace

All API requests must also include an Authorization: Bearer $TOKEN header. The curl examples in this guide assume you have stored the EvalHub route URL in the EVALHUB_URL environment variable and a valid bearer token in the TOKEN environment variable. For information about obtaining the route URL, see Section 5.3, “Deploy EvalHub with the TrustyAI Operator”. For information about obtaining a bearer token, see Section 5.24, “Grant access to EvalHub”.

5.5. List EvalHub providers and benchmarks
Copiar enlace

List the evaluation providers and benchmarks registered in EvalHub to see which evaluation frameworks and tasks are available for your jobs. You can list providers using the REST API, Python SDK, or CLI.

Prerequisites

You have a running EvalHub instance.

Procedure

List all registered providers:

$ curl -s -H "Authorization: Bearer $TOKEN" -H "X-Tenant: <namespace>" $EVALHUB_URL/api/v1/evaluations/providers | jq .

Example output

{
  "items": [
    {
      "resource": { "id": "lm_evaluation_harness", "owner": "system" },
      "name": "lm_evaluation_harness",
      "title": "LM Evaluation Harness",
      "benchmarks": [ ... ]
    },
    {
      "resource": { "id": "garak", "owner": "system" },
      "name": "garak",
      "title": "Garak",
      "benchmarks": [ ... ]
    }
  ]
}

Get a specific provider with its benchmarks:

$ curl -s -H "Authorization: Bearer $TOKEN" -H "X-Tenant: <namespace>" $EVALHUB_URL/api/v1/evaluations/providers/lm_evaluation_harness | jq .

Example output

{
  "resource": { "id": "lm_evaluation_harness", "owner": "system" },
  "name": "lm_evaluation_harness",
  "title": "LM Evaluation Harness",
  "benchmarks": [
    { "id": "mmlu", "name": "MMLU", "category": "reasoning" },
    { "id": "hellaswag", "name": "HellaSwag", "category": "reasoning" },
    { "id": "arc_challenge", "name": "ARC Challenge", "category": "reasoning" },
    ...
  ]
}

Alternatively, use the Python SDK:

from evalhub.client import SyncEvalHubClient

client = SyncEvalHubClient(
    base_url="https://evalhub.example.com",
    tenant="my-namespace"
)

for provider in client.providers.list():
    print(f"{provider.resource.id}: {provider.name}")

benchmarks = client.benchmarks.list(provider_id="lm_evaluation_harness")
for b in benchmarks:
    print(f"  {b.id}: {b.name}")

+ .Example output

lm_evaluation_harness: LM Evaluation Harness
garak: Garak
guidellm: GuideLLM
  mmlu: Massive Multitask Language Understanding
  hellaswag: HellaSwag
  gsm8k: Grade School Math 8K
  ...

Alternatively, use the CLI:

$ evalhub providers list

+ .Example output

 ID                     NAME                   DESCRIPTION                              BENCHMARKS
 lm_evaluation_harness  LM Evaluation Harness  EleutherAI language model evaluation     167
 garak                  Garak                  LLM vulnerability and safety scanner     12
 guidellm              GuideLLM               Performance benchmarking                  4

To get details for a specific provider:

$ evalhub providers describe lm_evaluation_harness

+ .Example output

Provider: LM Evaluation Harness
ID:       lm_evaluation_harness
Description: EleutherAI language model evaluation framework

Benchmarks (167):
 ID             NAME                             CATEGORY             METRICS
 mmlu           Massive Multitask Language Und…   knowledge            acc, acc_norm
 hellaswag      HellaSwag                         reasoning            acc, acc_norm
 gsm8k          Grade School Math 8K              math                 exact_match
 arc_easy       ARC Easy                          reasoning            acc, acc_norm
 ...

Verification

Confirm that the provider list is not empty and includes the built-in providers enabled in your EvalHub deployment.

5.6. Submit an evaluation job
Copiar enlace

Submit an evaluation job in EvalHub by specifying a model endpoint and one or more benchmarks. EvalHub runs the benchmarks against the model and returns a job ID that you can use to track results.

Prerequisites

You have a running EvalHub instance.
You have a model endpoint accessible from within the cluster.
You know which providers and benchmarks are available. See Section 5.5, “List EvalHub providers and benchmarks”.

Procedure

Submit a job by specifying the model endpoint and one or more benchmarks:

$ curl -X POST $EVALHUB_URL/api/v1/evaluations/jobs \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -H "X-Tenant: <namespace>" \
  -d '{
    "model": {
      "url": "http://my-model.my-namespace.svc.cluster.local:8080/v1",
      "name": "my-model"
    },
    "benchmarks": [
      {
        "provider_id": "lm_evaluation_harness",
        "benchmark_id": "mmlu"
      },
      {
        "provider_id": "lm_evaluation_harness",
        "benchmark_id": "hellaswag"
      }
    ]
  }'

Note

Most providers expect the model URL to point to an OpenAI-compatible inference endpoint. The required URL format may vary depending on the provider. Check the provider documentation for specific requirements.

The server returns a 202 Accepted response with the job resource, including a job ID for tracking.

Alternatively, use the Python SDK:

from evalhub.client import SyncEvalHubClient
from evalhub.models import JobSubmissionRequest, ModelConfig, BenchmarkConfig

client = SyncEvalHubClient(
    base_url="https://evalhub.example.com",
    tenant="my-namespace"
)

job = client.jobs.create(JobSubmissionRequest(
    model=ModelConfig(
        url="http://my-model.my-namespace.svc.cluster.local:8080/v1",
        name="my-model"
    ),
    benchmarks=[
        BenchmarkConfig(provider_id="lm_evaluation_harness", benchmark_id="mmlu"),
        BenchmarkConfig(provider_id="lm_evaluation_harness", benchmark_id="hellaswag"),
    ]
))

print(f"Job ID: {job.resource.id}")

Alternatively, use the CLI:

$ evalhub eval run \
    --name my-eval \
    --model-url http://my-model.my-namespace.svc.cluster.local:8080/v1 \
    --model-name my-model \
    --provider lm_evaluation_harness \
    -b mmlu -b hellaswag

You can also submit from a YAML config file:

$ evalhub eval run --config evaljob.yaml

Verification

Confirm the job is registered and check its status:

$ curl -s -H "Authorization: Bearer $TOKEN" -H "X-Tenant: <namespace>" \
    $EVALHUB_URL/api/v1/evaluations/jobs/<job_id> | jq .status.state

The job status transitions from pending to running to completed.

Alternatively, use the CLI:

$ evalhub eval status <job_id>

Alternatively, use the Python SDK:

job = client.jobs.get(job_id)
print(job.state)

5.7. Track evaluation jobs and results
Copiar enlace

Track the status of running evaluation jobs and retrieve results after completion. You can check individual jobs, list all jobs, and filter by status.

Prerequisites

You have submitted an evaluation job to EvalHub.
You have the job ID returned from the submission.

Procedure

Check the status of a specific job:

$ curl -s \
    -H "Authorization: Bearer $TOKEN" \
    -H "X-Tenant: <namespace>" \
    $EVALHUB_URL/api/v1/evaluations/jobs/<job_id> | jq .

Example response for a completed job

{
  "resource": {
    "id": "<job_id>",
    "tenant": "<namespace>",
    "created_at": "2026-04-22T10:00:00Z"
  },
  "status": {
    "state": "completed",
    "benchmarks": [
      { "id": "mmlu", "provider_id": "lm_evaluation_harness", "status": "completed" },
      { "id": "hellaswag", "provider_id": "lm_evaluation_harness", "status": "completed" }
    ]
  },
  "results": {
    "benchmarks": [
      {
        "id": "mmlu",
        "provider_id": "lm_evaluation_harness",
        "metrics": { "acc": 0.65, "acc_norm": 0.68 }
      },
      {
        "id": "hellaswag",
        "provider_id": "lm_evaluation_harness",
        "metrics": { "acc": 0.72, "acc_norm": 0.75 }
      }
    ]
  },
  "name": "my-eval",
  "model": {
    "url": "http://my-model:8080/v1",
    "name": "my-model"
  },
  ...
}

After the job completes, retrieve the benchmark results:
```
$ curl -s \
    -H "Authorization: Bearer $TOKEN" \
    -H "X-Tenant: <namespace>" \
    $EVALHUB_URL/api/v1/evaluations/jobs/<job_id> | jq .results
```
The results object contains benchmark scores, metrics, and pass/fail outcomes. If pass criteria are configured, the results include a test field with the overall score, threshold, and pass/fail status.

List all jobs, optionally filtered by status:

$ curl -s \
    -H "Authorization: Bearer $TOKEN" \
    -H "X-Tenant: <namespace>" \
    "$EVALHUB_URL/api/v1/evaluations/jobs?status=completed&limit=10" | jq .

Expand

Table 5.2. Job query parameters
Parameter	Default	Description
`limit`	`50`	Maximum number of results to return. The maximum allowed value is 100.
`offset`	`0`	Number of results to skip for pagination.
`status`	—	Filter by job state: `pending`, `running`, `completed`, `failed`, `cancelled`, `partially_failed`.
`name`	—	Filter by job name. Uses exact, case-sensitive matching.
`tags`	—	Filter by a single tag. Returns jobs that contain the specified tag in their tags list.
`owner`	—	Filter by the authenticated username of the job owner, for example `system:serviceaccount:<namespace>:<name>` for a `ServiceAccount` or the OpenShift username.
`experiment_id`	—	Filter by MLflow experiment ID.

Alternatively, use the CLI.

To watch a job’s status in real time, use the --watch flag. The CLI polls the job at regular intervals and displays benchmark progress until the job reaches a terminal state:

$ evalhub eval status --watch <job_id>

To retrieve formatted results after a job completes:

$ evalhub eval results <job_id> --format table

+ .Example output

 BENCHMARK   PROVIDER                METRIC     VALUE
 mmlu        lm_evaluation_harness   acc        0.65
 mmlu        lm_evaluation_harness   acc_norm   0.68
 hellaswag   lm_evaluation_harness   acc        0.72
 hellaswag   lm_evaluation_harness   acc_norm   0.75

The --format flag supports table, json, yaml, and csv.

Alternatively, use the Python SDK.

To check the status of a specific job:

job = client.jobs.get(job_id)
print(f"State: {job.state}")

To wait for a job to complete:

result = client.jobs.wait_for_completion(job_id, timeout=3600, poll_interval=5.0)
for b in result.results.benchmarks:
    print(f"{b.id}: {b.metrics}")

To list jobs filtered by status:

from evalhub.models import JobStatus

completed_jobs = client.jobs.list(status=JobStatus.COMPLETED, limit=10)
for job in completed_jobs:
    print(f"{job.id}: {job.state}")

5.8. Cancel and delete jobs
Copiar enlace

Cancel a running evaluation job or permanently delete a job record from the database.

Prerequisites

You have submitted an evaluation job to EvalHub.
You have the job ID of the job to cancel or delete.
You have delete permissions on the evaluations virtual resource in the tenant namespace. For more information, see Section 5.24, “Grant access to EvalHub”.

Procedure

Run one of the following commands depending on whether you want to cancel or permanently delete the job:

To cancel a running job with a soft delete, where the job is marked as cancelled but the record is preserved for auditing, run the following command:
```
$ curl -X DELETE -H "Authorization: Bearer $TOKEN" -H "X-Tenant: <namespace>" $EVALHUB_URL/api/v1/evaluations/jobs/<job_id>
```
To permanently delete a job record from the database, run the following command with the hard_delete query parameter:
Warning
The hard_delete operation permanently removes the job record from the database. This action cannot be undone, and the job results will no longer be available for auditing.
```
$ curl -X DELETE -H "Authorization: Bearer $TOKEN" -H "X-Tenant: <namespace>" "$EVALHUB_URL/api/v1/evaluations/jobs/<job_id>?hard_delete=true"
```

For both soft and hard deletes, EvalHub cleans up associated Job and ConfigMap Kubernetes resources in the tenant namespace before updating or removing the record. The server returns 204 No Content on success.

Alternatively, use the CLI.

To cancel a running job with a soft delete:

$ evalhub eval cancel <job_id>

To permanently delete a job with a hard delete:

$ evalhub eval cancel <job_id> --hard

Alternatively, use the Python SDK.

To cancel a running job with a soft delete:

client.jobs.cancel(job_id)

To permanently delete a job with a hard delete:

client.jobs.cancel(job_id, hard_delete=True)

Verification

For a soft delete, verify the job status is cancelled:

$ curl -s -H "Authorization: Bearer $TOKEN" -H "X-Tenant: <namespace>" \
    $EVALHUB_URL/api/v1/evaluations/jobs/<job_id> | jq .status.state

Alternatively, use the CLI:

$ evalhub eval status <job_id>

Alternatively, use the Python SDK:

job = client.jobs.get(job_id)
print(job.state)

For a hard delete, verify the job returns 404 Not Found:

$ curl -s -o /dev/null -w "%{http_code}" \
    -H "Authorization: Bearer $TOKEN" \
    -H "X-Tenant: <namespace>" \
    $EVALHUB_URL/api/v1/evaluations/jobs/<job_id>

The CLI and Python SDK raise an error when retrieving a hard-deleted job, confirming that the record has been removed.

5.9. EvalHub built-in collections
Copiar enlace

EvalHub includes several built-in collections that group benchmarks from one or more providers into reusable evaluation suites. Each benchmark in a collection can have its own weight, primary score metric, and pass criteria threshold. For more information, see Section 5.1, “Understanding EvalHub”.

Expand

Table 5.3. Built-in collections
Collection	Category	Description	Benchmarks
`leaderboard-v2`	general	Open LLM Leaderboard v2. Comprehensive evaluation suite for general-purpose language models.	`leaderboard_ifeval`, `leaderboard_bbh`, `leaderboard_gpqa`, `leaderboard_mmlu_pro`, `leaderboard_musr`, `leaderboard_math_hard`
`safety-and-fairness-v1`	safety	Evaluates model safety, bias, and fairness across diverse scenarios.	`truthfulqa_mc1`, `toxigen`, `winogender`, `crows_pairs_english`, `bbq`, `ethics_cm`
`toxicity-and-ethical-principles`	safety	End-to-end safety assessment covering toxic content generation, tendency to produce false or misleading information, and alignment with ethical principles.	`toxigen`, `truthfulqa_mc1`, `hhh_alignment`

Each built-in collection defines per-benchmark weights and thresholds. For example, the safety-and-fairness-v1 collection assigns higher weights to toxigen and ethics_cm (weight 3) than to winogender and crows_pairs_english (weight 1), which gives these benchmarks greater influence on the overall safety score.

5.10. Create a custom collection in EvalHub
Copiar enlace

Create a custom collection that groups benchmarks from one or more providers into a reusable evaluation job.

Prerequisites

You have a running EvalHub instance.

Procedure

Create a collection:

$ curl -X POST $EVALHUB_URL/api/v1/evaluations/collections \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -H "X-Tenant: <namespace>" \
  -d '{
    "name": "my-safety-suite",
    "category": "safety",
    "benchmarks": [
      {"provider_id": "lm_evaluation_harness", "benchmark_id": "truthfulqa_mc2"},
      {"provider_id": "garak", "benchmark_id": "owasp_llm_top_10"}
    ]
  }'

Example response

{
  "resource": {
    "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "tenant": "<namespace>",
    "created_at": "2026-04-22T10:00:00Z",
    "owner": "<username>"
  },
  "name": "my-safety-suite",
  "category": "safety",
  "benchmarks": [
    {"provider_id": "lm_evaluation_harness", "id": "truthfulqa_mc2"},
    {"provider_id": "garak", "id": "owasp_llm_top_10"}
  ]
}

Alternatively, use the CLI with a YAML spec file:

my-safety-suite.yaml

name: my-safety-suite
category: safety
benchmarks:
  - provider_id: lm_evaluation_harness
    benchmark_id: truthfulqa_mc2
  - provider_id: garak
    benchmark_id: owasp_llm_top_10

$ evalhub collections create --file my-safety-suite.yaml

Alternatively, use the Python SDK:

collection = client.collections.create({
    "name": "my-safety-suite",
    "category": "safety",
    "benchmarks": [
        {"provider_id": "lm_evaluation_harness", "benchmark_id": "truthfulqa_mc2"},
        {"provider_id": "garak", "benchmark_id": "owasp_llm_top_10"}
    ]
})

Verification

Confirm the collection was created:

$ curl -s -H "Authorization: Bearer $TOKEN" -H "X-Tenant: <namespace>" \
    $EVALHUB_URL/api/v1/evaluations/collections/<collection_id> | jq .

Alternatively, use the CLI:

$ evalhub collections describe <collection_id>

Alternatively, use the Python SDK:

collection = client.collections.get(collection_id)

Using a collection in a job

After creating a collection, you can submit evaluation jobs that reference it. The following example shows a job submission using the created collection:

$ curl -X POST $EVALHUB_URL/api/v1/evaluations/jobs \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -H "X-Tenant: <namespace>" \
  -d '{
    "model": {
      "url": "http://my-model.my-namespace.svc.cluster.local:8080/v1",
      "name": "my-model"
    },
    "collection": {
      "id": "<collection_id>"
    }
  }'

5.11. Configure API key authentication for model endpoints
Copiar enlace

Configure EvalHub to authenticate to a model endpoint using an API key stored as a Kubernetes Secret.

Prerequisites

You have the model endpoint url.
You have the API key for your model endpoint.

Procedure

Create a Secret containing your API key:

model-auth.yaml

apiVersion: v1
kind: Secret
metadata:
  name: model-auth
type: Opaque
stringData:
  api-key: "<api-key>"

Apply the Secret to the tenant namespace:

$ oc apply -f model-auth.yaml -n <namespace>

Verification

Confirm that the Secret was created and contains the expected api-key key:
```
$ oc get secret model-auth -n <namespace> -o jsonpath='{.data}' | jq 'keys'
```
The output should include <api-key>.

Next steps

When you submit an evaluation job, include an auth field in the model object to reference the Secret:

Example model configuration with API key authentication

"model": {
  "url": "http://my-model.my-namespace.svc.cluster.local:8080/v1",
  "name": "my-model",
  "auth": {
    "secret_ref": "model-auth"
  }
}

where:

secret_ref

Specifies the name of the Secret that contains the API key.

Section 5.6, “Submit an evaluation job”

5.12. Authenticate models with a ServiceAccount token
Copiar enlace

For models served with KServe and protected by kube-rbac-proxy, EvalHub can use automatic ServiceAccount token injection.

Procedure

Create a RoleBinding granting the job ServiceAccount access to the model’s InferenceService.

For more information about creating a ServiceAccount and RoleBinding for model authentication, see Making authenticated inference requests in Deploying models with distributed inference.

5.13. Use custom data from S3 for EvalHub evaluations
Copiar enlace

You can load external test datasets from S3-compatible storage, such as MinIO or Amazon S3, before an evaluation runs. When configured, EvalHub schedules an init container that downloads the data to /test_data inside the Job pod. The adapter can then read the files from that path.

Note

This feature only applies when EvalHub runs benchmarks as Jobs. It does not apply to local-only evaluation runs.

Prerequisites

You have an S3-compatible storage endpoint with your test dataset already uploaded to a bucket.
You have the S3 credentials for your storage endpoint.

Procedure

Create a Secret containing your S3 credentials:
my-s3-credentials.yaml
```
apiVersion: v1
kind: Secret
metadata:
  name: my-s3-credentials
  namespace: <namespace>
type: Opaque
stringData:
  AWS_ACCESS_KEY_ID: "<your-access-key>"
  AWS_SECRET_ACCESS_KEY: "<your-secret-key>"
  AWS_DEFAULT_REGION: "<your-region>"
  AWS_S3_ENDPOINT: "<your-s3-endpoint>"
```
where:
AWS_DEFAULT_REGION
Specifies the region for your S3-compatible storage, for example us-east-1.
AWS_S3_ENDPOINT
Specifies the endpoint URL for your S3-compatible storage, for example https://minio.example.com:9000 for MinIO. For Amazon S3, you can omit this field or use the default AWS endpoint.
$ oc apply -f my-s3-credentials.yaml
When you submit an evaluation job, add a test_data_ref block to each benchmark that requires external data:
Example S3 test data configuration in a job submission
```
"benchmarks": [
  {
    "provider_id": "lm_evaluation_harness",
    "benchmark_id": "mmlu",
    "test_data_ref": {
      "s3": {
        "bucket": "my-eval-data",
        "key": "datasets/mmlu",
        "secret_ref": "my-s3-credentials"
      }
    }
  }
]
```
where:
s3.bucket
Specifies the S3 bucket name.
s3.key
Specifies the S3 key prefix for the dataset files.
s3.secret_ref
Specifies the name of the Secret containing the S3 credentials.
For the full job submission request, see Section 5.6, “Submit an evaluation job”.
The init container downloads all objects under the specified S3 prefix to /test_data, preserving the relative directory structure. The secret_ref must reference a Secret in the tenant namespace.

Note

The expected file format and directory structure of the test data depend on the adapter and benchmark. See the adapter documentation for the required data layout.

Alternatively, use the CLI:

$ evalhub eval run \
    --name s3-data-eval \
    --model-url http://my-model.my-namespace.svc.cluster.local:8080/v1 \
    --model-name my-model \
    --provider lm_evaluation_harness \
    --benchmark mmlu \
    --test-data-s3-bucket my-eval-data \
    --test-data-s3-key datasets/mmlu \
    --test-data-s3-secret my-s3-credentials

Alternatively, use the Python SDK:

from evalhub.models import (
    JobSubmissionRequest, ModelConfig, BenchmarkConfig,
    TestDataRef, S3TestDataRef
)

job = client.jobs.submit(JobSubmissionRequest(
    name="s3-data-eval",
    model=ModelConfig(
        url="http://my-model.my-namespace.svc.cluster.local:8080/v1",
        name="my-model"
    ),
    benchmarks=[
        BenchmarkConfig(
            id="mmlu",
            provider_id="lm_evaluation_harness",
            test_data_ref=TestDataRef(
                s3=S3TestDataRef(
                    bucket="my-eval-data",
                    key="datasets/mmlu",
                    secret_ref="my-s3-credentials",
                )
            ),
        )
    ],
))

Collections also support test_data_ref on individual benchmarks, allowing you to define custom data sources as part of a reusable evaluation suite.

Verification

Confirm that the job completes successfully. If the init container fails to download data from S3, the job transitions to the failed state.

$ curl -s \
    -H "Authorization: Bearer $TOKEN" \
    -H "X-Tenant: <namespace>" \
    $EVALHUB_URL/api/v1/evaluations/jobs/<job_id> | jq .status.state

If the job fails, check the init container logs for download errors:

$ oc logs <pod_name> -c init -n <namespace>

5.14. Export evaluation results to an OCI registry
Copiar enlace

EvalHub can export evaluation artifacts, such as logs, metrics, and outputs, by pushing artifacts to an Open Container Initiative (OCI) compatible registry for long-term storage and traceability.

Prerequisites

You have access to an OCI-compatible container registry such as Quay.io.
You have registry credentials for the OCI registry.

Procedure

Create a kubernetes.io/dockerconfigjson Secret with your registry credentials:

$ oc create secret docker-registry oci-registry-credentials \
    --docker-server=quay.io \
    --docker-username=<username> \
    --docker-password=<password> \
    -n <namespace>

When you submit an evaluation job, include an exports block in the job submission body:
Example OCI export configuration in a job submission
```
"benchmarks": [
  {
    "provider_id": "lm_evaluation_harness",
    "benchmark_id": "mmlu"
  }
],
"exports": {
  "oci": {
    "coordinates": {
      "oci_host": "quay.io",
      "oci_repository": "my-org/eval-results"
    },
    "k8s": {
      "connection": "oci-registry-credentials"
    }
  }
}
```
where:
oci.coordinates.oci_host
Specifies the OCI registry hostname.
oci.coordinates.oci_repository
Specifies the repository path within the registry.
oci.k8s.connection
Specifies the name of the Secret containing the registry credentials.
For the full job submission request, see Section 5.6, “Submit an evaluation job”.

Results artifact from the evaluation frameworks are stored as OCI artifacts with separate layers, allowing selective access to specific outputs.

Verification

After the job completes, retrieve the OCI artifact reference from the job results:

$ curl -s -H "Authorization: Bearer $TOKEN" -H "X-Tenant: <namespace>" \
    $EVALHUB_URL/api/v1/evaluations/jobs/<job_id> | jq '.results.benchmarks[0].artifacts'

Verify the artifact exists in the registry by using skopeo:
```
$ skopeo inspect --creds <username>:<password> docker://quay.io/my-org/eval-results:<tag>
```
The tag is in the format evalhub-<hash>, where the hash is derived from the job ID, provider, and benchmark. You can find the full OCI reference, including the tag, in the job results.

5.15. Configure MLflow experiment tracking for evaluation jobs
Copiar enlace

When MLflow is configured for EvalHub, you can associate evaluation jobs with designated MLflow experiments. EvalHub automatically logs benchmark metrics as MLflow runs within the experiment.

Prerequisites

You have a running MLflow instance accessible from the EvalHub deployment.
You have configured the MLflow tracking URI in the EvalHub configuration. See Section 5.21, “EvalHub configuration reference” for details.

Procedure

When you submit an evaluation job, include an experiment block in the job submission body:
Example experiment configuration in a job submission
```
"benchmarks": [
  {
    "provider_id": "lm_evaluation_harness",
    "benchmark_id": "mmlu"
  }
],
"experiment": {
  "name": "my-model-v2-eval"
}
```
For the full job submission request, see Section 5.6, “Submit an evaluation job”.

When using the CLI, include the experiment field in your YAML config file:

Example experiment fragment in a YAML config file

experiment:
  name: my-model-v2-eval

$ evalhub eval run --config eval-with-mlflow.yaml

+ For the full YAML config file structure, see Section 5.6, “Submit an evaluation job”.

When using the Python SDK, pass an ExperimentConfig to the JobSubmissionRequest:

from evalhub.models import ExperimentConfig

experiment=ExperimentConfig(name="my-model-v2-eval")

+ For the full JobSubmissionRequest, see Section 5.6, “Submit an evaluation job”.

Verification

When the job completes, the results section includes an mlflow_experiment_url linking to the experiment in the MLflow UI:

$ curl -s -H "Authorization: Bearer $TOKEN" -H "X-Tenant: <namespace>" \
    $EVALHUB_URL/api/v1/evaluations/jobs/<job_id> | jq .results.mlflow_experiment_url

Example output

"https://mlflow.example.com/#/experiments/42"

Alternatively, use the CLI. The evalhub eval results command automatically displays the MLflow experiment URL when available:

$ evalhub eval results <job_id>

Alternatively, use the Python SDK:

job = client.jobs.get(job_id)
print(job.results.mlflow_experiment_url)

5.16. Add a custom provider by using the API
Copiar enlace

Register a custom provider by using the REST API. A provider definition includes a name, a container image for the adapter runtime, and a list of benchmarks. For more information about adapters, see Section 5.1, “Understanding EvalHub”.

Prerequisites

You have a running EvalHub instance.
You have a container image for your custom adapter packaged as a UBI9 image.

Procedure

Register the custom provider:

$ curl -X POST $EVALHUB_URL/api/v1/evaluations/providers \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -H "X-Tenant: <namespace>" \
  -d '{
    "name": "my-custom-provider",
    "title": "My Custom Provider",
    "description": "Custom evaluation framework for domain-specific benchmarks.",
    "benchmarks": [
      {
        "id": "domain_accuracy",
        "name": "Domain Accuracy",
        "category": "general",
        "metrics": ["accuracy", "f1"],
        "primary_score": {
          "metric": "accuracy",
          "lower_is_better": false
        },
        "pass_criteria": {
          "threshold": 0.8
        }
      }
    ],
    "runtime": {
      "k8s": {
        "image": "quay.io/my-org/my-adapter:latest",
        "cpu_request": "500m",
        "memory_request": "512Mi",
        "cpu_limit": "2000m",
        "memory_limit": "4Gi"
      }
    }
  }'

Example response

{
  "resource": {
    "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "tenant": "<namespace>",
    "created_at": "2026-04-22T10:00:00Z",
    "owner": "<username>"
  },
  "name": "my-custom-provider",
  "title": "My Custom Provider",
  "description": "Custom evaluation framework for domain-specific benchmarks.",
  "benchmarks": [
    {
      "id": "domain_accuracy",
      "name": "Domain Accuracy",
      "category": "general",
      "metrics": ["accuracy", "f1"],
      "primary_score": { "metric": "accuracy", "lower_is_better": false },
      "pass_criteria": { "threshold": 0.8 }
    }
  ],
  "runtime": {
    "k8s": {
      "image": "quay.io/my-org/my-adapter:latest",
      "cpu_request": "500m",
      "memory_request": "512Mi",
      "cpu_limit": "2000m",
      "memory_limit": "4Gi"
    }
  }
}

The runtime.k8s section specifies the container image and resource requests for the adapter pod. Each benchmark must declare an id, name, and category. The optional primary_score and pass_criteria fields set default thresholds for the benchmark.

User-created providers can be updated and deleted through the API. Built-in providers with owner: system are read-only.

Note

The Python SDK and CLI do not support creating providers. Use the REST API to register custom providers.

Verification

Confirm the provider was registered by retrieving it with the ID from the response:

$ curl -s -H "Authorization: Bearer $TOKEN" -H "X-Tenant: <namespace>" \
    $EVALHUB_URL/api/v1/evaluations/providers/<provider_id> | jq .name

The output should return "my-custom-provider".

Alternatively, use the CLI:

$ evalhub providers describe <provider_id>

Alternatively, use the Python SDK:

provider = client.providers.get(provider_id)
print(provider.name)

5.17. Add a custom provider by using a ConfigMap
Copiar enlace

Add providers at the operator level by creating a ConfigMap in the operator namespace with the appropriate labels. The TrustyAI Operator discovers ConfigMap(s) by label and mounts them into the EvalHub deployment automatically. Providers registered this way are system-owned, read-only, and available to all tenants. To register a tenant-scoped provider that can be updated or deleted, use the REST API instead. See Section 5.16, “Add a custom provider by using the API”.

Prerequisites

You have a running EvalHub deployment.
You have a container image for your custom adapter. See Section 5.19, “Write a custom evaluation adapter”.
You have cluster administrator privileges or permissions to create ConfigMap resources in the operator namespace.
You have permissions to edit the EvalHub custom resource.

Procedure

Create a ConfigMap in the EvalHub custom resource namespace with the provider definition:

evalhub-provider-my-custom-provider.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: evalhub-provider-my-custom-provider
  namespace: <evalhub-namespace>
  labels:
    trustyai.opendatahub.io/evalhub-provider-type: system
    trustyai.opendatahub.io/evalhub-provider-name: my-custom-provider
data:
  my-custom-provider.yaml: |
    id: my-custom-provider
    name: My Custom Provider
    description: Custom evaluation framework for domain-specific benchmarks.
    runtime:
      k8s:
        image: quay.io/my-org/my-adapter:latest
        cpu_request: "500m"
        memory_request: "512Mi"
        cpu_limit: "2000m"
        memory_limit: "4Gi"
    benchmarks:
      - id: domain_accuracy
        name: Domain Accuracy
        category: general
        metrics:
          - accuracy
          - f1
        primary_score:
          metric: accuracy
          lower_is_better: false
        pass_criteria:
          threshold: 0.8

$ oc apply -f evalhub-provider-my-custom-provider.yaml

Reference the provider name in your EvalHub custom resource by adding it to the spec.providers list:
Example spec.providers fragment
```
spec:
  providers:
    - lm_evaluation_harness
    - garak
    - my-custom-provider
```
For the full EvalHub custom resource structure, see Section 5.3, “Deploy EvalHub with the TrustyAI Operator”.

The operator copies the ConfigMap to the instance namespace and mounts it as a projected volume at /etc/evalhub/config/providers. The EvalHub server loads all provider YAML files from this directory at startup.

Verification

Confirm that the ConfigMap was created:

$ oc get configmap evalhub-provider-my-custom-provider -n <evalhub-namespace>

Check that the EvalHub deployment has restarted and is ready:
```
$ oc get pods -l app=eval-hub -n <evalhub-namespace>
```

Confirm the custom provider is loaded:

$ curl -s -H "Authorization: Bearer $TOKEN" -H "X-Tenant: <namespace>" \
    $EVALHUB_URL/api/v1/evaluations/providers/my-custom-provider | jq .name

The output should return "My Custom Provider".

5.18. Add a collection by using a ConfigMap
Copiar enlace

Add collections at the operator level by creating a ConfigMap in the operator namespace with the appropriate labels. The TrustyAI Operator discovers ConfigMap(s) by label and mounts them into the EvalHub deployment automatically. Collections registered this way are system-owned, read-only, and available to all tenants. To create a tenant-scoped collection that can be updated or deleted, use the REST API instead. See Section 5.10, “Create a custom collection in EvalHub”.

Prerequisites

You have a running EvalHub deployment.
You have cluster administrator privileges or permissions to create ConfigMap resources in the operator namespace.
You have permissions to edit the EvalHub custom resource.
You know which provider-benchmark pairs you want to include in the collection. See Section 5.5, “List EvalHub providers and benchmarks”.

Procedure

Create a ConfigMap in the EvalHub custom resource namespace with the collection definition:

evalhub-collection-my-eval-suite.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: evalhub-collection-my-eval-suite
  namespace: <evalhub-namespace>
  labels:
    trustyai.opendatahub.io/evalhub-collection-type: system
    trustyai.opendatahub.io/evalhub-collection-name: my-eval-suite
data:
  my-eval-suite.yaml: |
    id: my-eval-suite
    name: My Evaluation Suite
    category: general
    description: Custom evaluation suite for internal model validation.
    pass_criteria:
      threshold: 0.7
    benchmarks:
      - id: mmlu
        provider_id: lm_evaluation_harness
        weight: 2
        primary_score:
          metric: acc_norm
          lower_is_better: false
        pass_criteria:
          threshold: 0.6
      - id: hellaswag
        provider_id: lm_evaluation_harness
        weight: 1
        primary_score:
          metric: acc_norm
          lower_is_better: false
        pass_criteria:
          threshold: 0.7

$ oc apply -f evalhub-collection-my-eval-suite.yaml

Reference the collection in your EvalHub custom resource by adding the collection name to the spec.collections list:
Example spec.collections fragment
```
spec:
  collections:
    - leaderboard-v2
    - safety-and-fairness-v1
    - my-eval-suite
```
For the full EvalHub custom resource structure, see Section 5.3, “Deploy EvalHub with the TrustyAI Operator”.

The operator mounts collection ConfigMap(s) at /etc/evalhub/config/collections.

Verification

Confirm that the ConfigMap was created:

$ oc get configmap evalhub-collection-my-eval-suite -n <evalhub-namespace>

Check that the EvalHub deployment has restarted and is ready:
```
$ oc get pods -l app=eval-hub -n <evalhub-namespace>
```

List collections and confirm the custom collection appears:

$ curl -s -H "Authorization: Bearer $TOKEN" -H "X-Tenant: <namespace>" \
    $EVALHUB_URL/api/v1/evaluations/collections/my-eval-suite | jq .name

The output should return "My Evaluation Suite".

5.19. Write a custom evaluation adapter
Copiar enlace

An adapter translates EvalHub job requests into evaluation framework-specific commands. To write a custom adapter, install the EvalHub SDK with adapter dependencies and implement a single method.

Prerequisites

You have Python 3.11 or later installed.
You have an evaluation framework that you want to integrate with EvalHub.
You have podman or another container build tool installed to package the adapter as a container image.

Procedure

Install the EvalHub SDK with the adapter extra:
```
$ pip install "eval-hub-sdk[adapter]"
```

Create a class that extends FrameworkAdapter and implements run_benchmark_job:

from evalhub.adapter import FrameworkAdapter
from evalhub.models import JobSpec, JobCallbacks, JobResults, JobStatusUpdate, JobPhase

class MyAdapter(FrameworkAdapter):
    def run_benchmark_job(self, config: JobSpec, callbacks: JobCallbacks) -> JobResults:
        callbacks.report_status(JobStatusUpdate(
            phase=JobPhase.RUNNING_EVALUATION,
            message="Running evaluation"
        ))

        # Replace with your framework's evaluation function
        scores = run_my_framework(
            model_url=config.model.url,
            benchmark=config.benchmark_id,
            parameters=config.parameters
        )

        return JobResults(
            id=config.id,
            benchmark_id=config.benchmark_id,
            benchmark_index=config.benchmark_index,
            model_name=config.model.name,
            results=scores,
            num_examples_evaluated=len(scores),
            duration_seconds=self._get_duration()  # Implement to return elapsed seconds
        )

The framework handles loading the job specification from the mounted ConfigMap, authenticating with the sidecar proxy container that communicates with the EvalHub server, and reporting results. Your adapter only needs to run the evaluation and return the results. For more information about the adapter and sidecar architecture, see Section 5.2, “EvalHub architecture overview”.

Package your adapter as a Red Hat Universal Base Image 9 (UBI9) container image. Create a Containerfile in your adapter directory:

Containerfile

FROM registry.access.redhat.com/ubi9/python-312

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY main.py /app/main.py

ENTRYPOINT ["python", "main.py"]

Build the image:

$ podman build -t quay.io/my-org/my-adapter:latest .

Push the image to a container registry:

$ podman push quay.io/my-org/my-adapter:latest

Reference the image in the provider’s runtime.k8s.image field when registering the provider. See Section 5.16, “Add a custom provider by using the API”.

The following tables describe the JobSpec and JobCallbacks interfaces available to your adapter.

Expand

Table 5.4. JobSpec fields
Field	Description
`id`	Unique job identifier.
`provider_id`	Identifier of the provider that the benchmark belongs to.
`benchmark_id`	Identifier of the benchmark to evaluate.
`benchmark_index`	Index of this benchmark within the job.
`model`	Model configuration, including `url` and `name`.
`parameters`	Benchmark-specific parameters, for example `num_fewshot` or `limit`.
`num_examples`	The number of examples to evaluate. When set to `None`, the adapter evaluates all examples.
`exports`	Optional OCI artifact export specification.

Expand

Table 5.5. JobCallbacks methods
Method	Purpose
`report_status(update)`	Sends progress updates including the phase, message, and completed/total steps.
`create_oci_artifact(spec)`	Pushes evaluation artifacts to an OCI registry.
`report_results(results)`	Reports the final results to the EvalHub server. This method is called automatically if you return `JobResults`.

5.20. EvalHub API endpoints
Copiar enlace

All endpoints use the path prefix /api/v1. The OpenAPI 3.1.0 specification is available at /openapi.yaml and interactive documentation is available at /docs.

5.20.1. Evaluation job endpoints
Copiar enlace

Expand

Table 5.6. Evaluation job endpoints
Endpoint	Method	Description
`/api/v1/evaluations/jobs`	POST	Create and submit an evaluation job. Returns `202 Accepted`.
`/api/v1/evaluations/jobs`	GET	List evaluation jobs with pagination and filtering.
`/api/v1/evaluations/jobs/\{id}`	GET	Get a specific evaluation job with current status and results.
`/api/v1/evaluations/jobs/\{id}`	DELETE	Cancel or hard-delete a job. Use `?hard_delete=true` for permanent removal.
`/api/v1/evaluations/jobs/\{id}/events`	POST	Submit job status events from the adapter runtime.

Expand

Table 5.7. Evaluation job states
State	Description
`pending`	The job is created and awaiting execution.
`running`	The evaluation is actively running.
`completed`	All benchmarks completed successfully.
`failed`	The evaluation encountered a fatal error.
`cancelled`	The user canceled the job.
`partially_failed`	Some benchmarks succeed and others failed.

5.20.2. Provider endpoints
Copiar enlace

Expand

Table 5.8. Provider endpoints
Endpoint	Method	Description
`/api/v1/evaluations/providers`	POST	Create a custom provider.
`/api/v1/evaluations/providers`	GET	List providers. Use `?benchmarks=true` to include benchmarks.
`/api/v1/evaluations/providers/\{id}`	GET	Get a provider with all its benchmarks.
`/api/v1/evaluations/providers/\{id}`	PUT	Replace a provider.
`/api/v1/evaluations/providers/\{id}`	PATCH	Patch a provider with JSON Patch operations.
`/api/v1/evaluations/providers/\{id}`	DELETE	Delete a provider.

Expand

Table 5.9. Built-in providers
Provider	Benchmarks	Description
`lm_evaluation_harness`	167	General-purpose LLM evaluation: MMLU, HellaSwag, ARC, TruthfulQA, GSM8K, and more across 12 categories.
`garak`	8	Security vulnerability scanning: OWASP LLM Top 10, AVID taxonomy, CWE.
`guidellm`	7	Guidance language model evaluation.
`lighteval`	24	Lightweight evaluation framework.

5.20.3. Collection endpoints
Copiar enlace

Expand

Table 5.10. Collection endpoints
Endpoint	Method	Description
`/api/v1/evaluations/collections`	POST	Create a benchmark collection.
`/api/v1/evaluations/collections`	GET	List collections with filtering.
`/api/v1/evaluations/collections/\{id}`	GET	Get a collection with all benchmark references.
`/api/v1/evaluations/collections/\{id}`	PUT	Replace a collection.
`/api/v1/evaluations/collections/\{id}`	PATCH	Patch a collection with JSON Patch operations.
`/api/v1/evaluations/collections/\{id}`	DELETE	Delete a collection.

5.20.4. Health and observability endpoints
Copiar enlace

Expand

Table 5.11. Health and observability endpoints
Endpoint	Method	Description
`/api/v1/health`	GET	Health check with status, timestamp, and build information.
`/metrics`	GET	Prometheus metrics endpoint when enabled.
`/openapi.yaml`	GET	OpenAPI 3.1.0 specification in YAML or JSON based on Accept header.
`/docs`	GET	Interactive Swagger UI documentation.

5.21. EvalHub configuration reference
Copiar enlace

Configuration applies to the EvalHub server component. EvalHub is configured by using config/config.yaml and environment variables. Environment variables take precedence over the configuration file.

When deploying EvalHub with the TrustyAI Operator, the operator generates the config.yaml automatically from the EvalHub custom resource and environment variables defined in the spec.env field. You do not need to create or edit config.yaml directly. For information about configuring the EvalHub custom resource, see Section 5.3, “Deploy EvalHub with the TrustyAI Operator”.

5.21.1. Service configuration
Copiar enlace

Expand

Table 5.12. Service parameters
Parameter	Environment variable	Default	Description
`service.port`	`PORT`	`8080`	The port that the API server listens on.
`service.host`	`API_HOST`	`127.0.0.1`	The address that the API server binds to.
`service.tls_cert_file`	`TLS_CERT_FILE`	—	Path to the TLS certificate file.
`service.tls_key_file`	`TLS_KEY_FILE`	—	Path to the TLS private key file.
`service.disable_auth`	—	`false`	Disables authentication and authorization. Setting this to `true` allows unauthenticated access to all endpoints. Do not enable this in production environments.

5.21.2. Database configuration
Copiar enlace

Note

When deploying EvalHub with the TrustyAI Operator, you must set spec.database.type in the EvalHub custom resource to either postgresql or sqlite. The operator generates the corresponding configuration automatically. The postgresql option sets the driver to pgx and injects the connection URL from a Kubernetes Secret. The sqlite option sets the driver to sqlite with an in-memory database. Data is not persisted across restarts with sqlite. Use postgresql for production deployments.

The following table describes the parameters available in the EvalHub config/config.yaml configuration file.

Expand

Table 5.13. Database parameters
Parameter	Environment variable	Default	Description
`database.driver`	—	`sqlite`	The storage driver. Supported values: `sqlite`, `pgx`. The default `sqlite` option uses an in-memory database and data is not persisted across restarts. Use `pgx` with PostgreSQL for production deployments.
`database.url`	`DB_URL`	`file::eval_hub:?mode=memory&cache=shared`	The database connection string. The default value is a SQLite in-memory URI, which stores all data in memory and does not persist across restarts. For PostgreSQL, use the format `postgres://user:password@host:5432/eval_hub`. Store the connection string in a Kubernetes Secret rather than inline to avoid exposing credentials. For instructions, see Section 5.3, “Deploy EvalHub with the TrustyAI Operator”.

5.21.3. MLflow configuration
Copiar enlace

Expand

Table 5.14. MLflow parameters
Parameter	Environment variable	Default	Description
`mlflow.tracking_uri`	`MLFLOW_TRACKING_URI`	—	The URL of the MLflow tracking server. Setting this parameter enables MLflow integration. When set, evaluation results are logged to MLflow. Without this parameter, MLflow tracking is disabled.
`mlflow.ca_cert_path`	`MLFLOW_CA_CERT_PATH`	—	The path to a TLS CA certificate file for verifying the MLflow server’s certificate.
`mlflow.insecure_skip_verify`	`MLFLOW_INSECURE_SKIP_VERIFY`	`false`	If `true`, skips TLS certificate verification when connecting to MLflow. Use this option only for testing with self-signed certificates. Do not enable this in production environments.
`mlflow.token_path`	`MLFLOW_TOKEN_PATH`	—	The path to a file containing an authentication token for the MLflow server. The token is sent as a Bearer token in the `Authorization` header. The default path is `/var/run/secrets/mlflow/token`, which is a projected `ServiceAccount` token.
`mlflow.workspace`	`MLFLOW_WORKSPACE`	—	The MLflow workspace or experiment namespace.

5.21.4. OpenTelemetry configuration
Copiar enlace

When deploying with the TrustyAI Operator, include the otel field in the EvalHub custom resource to enable OpenTelemetry. The presence of the otel field in the CR enables OpenTelemetry automatically.

Expand

Table 5.15. OpenTelemetry parameters available in the EvalHub custom resource
CR field	Default	Description
`otel.exporterType`	`otlp-grpc`	The exporter type. Supported values: `otlp-grpc`, `otlp-http`, `stdout`.
`otel.exporterEndpoint`	—	The endpoint for the OTLP exporter, for example `localhost:4317` for gRPC.
`otel.exporterInsecure`	`false`	If `true`, disables TLS for the OTLP exporter connection. Do not enable this in production environments.
`otel.samplingRatio`	`1.0`	Trace sampling ratio as a value between `0` and `1`. For example, `0.5` samples 50% of traces.

5.22. EvalHub multi-tenancy and RBAC
Copiar enlace

EvalHub supports namespace-based multi-tenancy, where each Kubernetes namespace represents a tenant. EvalHub enforces isolation at multiple layers, including authentication, authorization, data access, and job execution.

EvalHub enforces isolation at the following layers:

Authentication — EvalHub uses the Kubernetes TokenReview API to validate bearer tokens in incoming requests.
Authorization — SubjectAccessReview (SAR) checks verify that the caller has permission to perform the requested operation on EvalHub virtual resources in the target namespace. Virtual resources are logical resource names that EvalHub defines for RBAC purposes under the trustyai.opendatahub.io API group. They do not correspond to Kubernetes custom resource definitions. The virtual resources are evaluations, collections, providers, and status-events. For the full list of verbs, see Section 5.25, “EvalHub roles reference”.
Data isolation — EvalHub scopes all database queries by tenant_id to prevent cross-tenant data access.
Job execution — EvalHub creates Job resources in the tenant’s namespace.

The X-Tenant request header determines the target tenant namespace. The X-User header identifies the authenticated user.

5.23. Set up a tenant namespace
Copiar enlace

Register a namespace as an EvalHub tenant so that users, programmatic clients, and agents can submit evaluation jobs in that namespace.

Prerequisites

You have cluster administrator privileges.
You have a running EvalHub instance.
You have a namespace to use as a tenant.

Procedure

Add the tenant label to the namespace:
```
$ oc label namespace <namespace> evalhub.trustyai.opendatahub.io/tenant=
```
The label value is intentionally empty. The TrustyAI Operator checks for the presence of the label, not its value.
Note
Use a dedicated namespace for EvalHub rather than redhat-ods-applications, as described in Section 5.3, “Deploy EvalHub with the TrustyAI Operator”. The redhat-ods-applications namespace has NetworkPolicy resources that restrict cross-namespace traffic, which requires additional labeling on tenant namespaces. If EvalHub is deployed in redhat-ods-applications, label each tenant namespace to allow the evaluation Job sidecar to communicate with the EvalHub server:
$ oc label namespace <namespace> opendatahub.io/generated-namespace=true
Review the NetworkPolicy resources with oc get networkpolicy -n <evalhub-server-namespace> to determine any additional requirements.

The TrustyAI Operator watches for this label and automatically provisions the following resources in the labeled namespace:

A job ServiceAccount used by evaluation Job pods as their identity.
A Role and RoleBinding granting the job ServiceAccount permission to create status-events for reporting job progress.
A RoleBinding granting the EvalHub API ServiceAccount permission to create and delete Job resources in the tenant namespace.
A RoleBinding granting the EvalHub API ServiceAccount permission to manage ConfigMap resources used to mount job specifications into Job pods.
A RoleBinding granting the job ServiceAccount access to MLflow resources when MLflow is configured.
A service CA ConfigMap with the cluster CA bundle injected by OpenShift, so that Job pods can make HTTPS requests to the EvalHub API.

When the tenant label is removed from a namespace, the controller cleans up all provisioned resources automatically.

Verification

Confirm that the tenant label is set on the namespace:

$ oc get namespace <namespace> --show-labels | grep evalhub

Confirm that the operator provisioned the expected resources in the tenant namespace:
```
$ oc get serviceaccount,rolebinding,configmap -n <namespace> | grep evalhub
```
The output should include a ServiceAccount, RoleBinding resources, and a service CA ConfigMap created by the operator.

5.24. Grant access to EvalHub
Copiar enlace

Grant tenant users access to EvalHub by creating a Role and RoleBinding in the tenant namespace. EvalHub supports three types of principals.

Prerequisites

You have permissions to create Role and RoleBinding resources in the tenant namespace.
You have impersonation privileges to verify access with oc auth can-i --as.
You have set up the target namespace as an EvalHub tenant.
You have identified which virtual resources and verbs to grant. See Section 5.25, “EvalHub roles reference” for available resources.

Procedure

Select the type of principal that matches your use case.

Expand

Table 5.16. Principal types
Principal type	Token source	Use case
`ServiceAccount`	Mounted pod token or long-lived token	Automation, CI/CD pipelines, agents using Model Context Protocol (MCP)
OpenShift User	`oc whoami -t`	Interactive use
OpenShift Group	User token with group membership	Team-based access

Create a Role in the tenant namespace that grants access to the required EvalHub virtual resources:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: evalhub-evaluator
  namespace: <namespace>
rules:
  - apiGroups: ["trustyai.opendatahub.io"]
    resources: ["evaluations", "collections", "providers"]
    verbs: ["get", "list", "create", "update", "delete"]
  - apiGroups: ["mlflow.kubeflow.org"]
    resources: ["experiments"]
    verbs: ["create", "get"]

$ oc apply -f evalhub-evaluator-role.yaml

Create a RoleBinding to bind the principal to the Role.

To grant access to a ServiceAccount:

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: my-sa-evalhub-access
  namespace: <namespace>
subjects:
  - kind: ServiceAccount
    name: my-sa
    namespace: <namespace>
roleRef:
  kind: Role
  name: evalhub-evaluator
  apiGroup: rbac.authorization.k8s.io

$ oc apply -f my-sa-evalhub-access.yaml

To obtain a bearer token for a ServiceAccount, run the following command:

$ export TOKEN=$(oc create token my-sa -n <namespace> --duration=1h)

To grant access to a User:

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: user-evalhub-access
  namespace: <namespace>
subjects:
  - kind: User
    name: <username>
roleRef:
  kind: Role
  name: evalhub-evaluator
  apiGroup: rbac.authorization.k8s.io

$ oc apply -f user-evalhub-access.yaml

To obtain a bearer token for an OpenShift User, log in as the user and run the following command:

$ export TOKEN=$(oc whoami -t)

To grant access to a Group:

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: team-evalhub-access
  namespace: <namespace>
subjects:
  - kind: Group
    name: evalhub-users
roleRef:
  kind: Role
  name: evalhub-evaluator
  apiGroup: rbac.authorization.k8s.io

$ oc apply -f team-evalhub-access.yaml

To obtain a bearer token for a Group member, log in as a user who belongs to the group and run the following command:

$ export TOKEN=$(oc whoami -t)

Verification

Verify that the principal has the expected permissions on the EvalHub virtual resources by using oc auth can-i.

For a ServiceAccount:

$ oc auth can-i create evaluations.trustyai.opendatahub.io \
    -n <namespace> \
    --as=system:serviceaccount:<namespace>:my-sa

For an OpenShift User:

$ oc auth can-i create evaluations.trustyai.opendatahub.io \
    -n <namespace> \
    --as=<username>

For an OpenShift Group:

$ oc auth can-i create evaluations.trustyai.opendatahub.io \
    -n <namespace> \
    --as=<username> --as-group=evalhub-users

Each command should return yes.

5.25. EvalHub roles reference
Copiar enlace

EvalHub uses virtual Kubernetes resources for tenant authorization. These resources do not correspond to actual Kubernetes API resources. EvalHub performs SubjectAccessReview (SAR) checks against these resources in the tenant namespace specified by the X-Tenant header.

To authorize tenant users, create a Role in the tenant namespace granting the required verbs on these virtual resources. For instructions, see Section 5.24, “Grant access to EvalHub”.

Expand

Table 5.17. Virtual resources for tenant authorization
API group	Resource	Verbs	Description
`trustyai.opendatahub.io`	`evaluations`	`get`, `list`, `create`, `update`, `delete`	Submit, view, update, and delete evaluation jobs.
`trustyai.opendatahub.io`	`collections`	`get`, `list`, `create`, `update`, `delete`	Create, view, update, and delete benchmark collections.
`trustyai.opendatahub.io`	`providers`	`get`, `list`, `create`, `update`, `delete`	Create, view, update, and delete evaluation providers.
`trustyai.opendatahub.io`	`status-events`	`create`	Report job progress. Used by operator-provisioned job ServiceAccounts, not by tenant users.
`mlflow.kubeflow.org`	`experiments`	`create`, `get`	Create and access MLflow experiments for result tracking.

5.26. Additional resources
Copiar enlace

The following resources provide additional information about EvalHub.

EvalHub documentation site
Server API reference — REST API endpoints and configuration
Python SDK reference — Client library documentation
CLI reference — Command-line interface guide
Architecture guide — Adapter pattern and adapter development
Multi-tenancy guide — Detailed RBAC and tenant configuration

Este contenido no está disponible en el idioma seleccionado.

5.1. Understanding EvalHubCopiar enlaceEnlace copiado en el portapapeles!

5.1.1. Core conceptsCopiar enlaceEnlace copiado en el portapapeles!

5.2. EvalHub architecture overviewCopiar enlaceEnlace copiado en el portapapeles!

5.3. Deploy EvalHub with the TrustyAI OperatorCopiar enlaceEnlace copiado en el portapapeles!

5.4. EvalHub multi-tenancyCopiar enlaceEnlace copiado en el portapapeles!

5.5. List EvalHub providers and benchmarksCopiar enlaceEnlace copiado en el portapapeles!

5.6. Submit an evaluation jobCopiar enlaceEnlace copiado en el portapapeles!

5.7. Track evaluation jobs and resultsCopiar enlaceEnlace copiado en el portapapeles!

5.8. Cancel and delete jobsCopiar enlaceEnlace copiado en el portapapeles!

5.9. EvalHub built-in collectionsCopiar enlaceEnlace copiado en el portapapeles!

5.10. Create a custom collection in EvalHubCopiar enlaceEnlace copiado en el portapapeles!

5.11. Configure API key authentication for model endpointsCopiar enlaceEnlace copiado en el portapapeles!

5.12. Authenticate models with a ServiceAccount tokenCopiar enlaceEnlace copiado en el portapapeles!

5.13. Use custom data from S3 for EvalHub evaluationsCopiar enlaceEnlace copiado en el portapapeles!

5.14. Export evaluation results to an OCI registryCopiar enlaceEnlace copiado en el portapapeles!

5.15. Configure MLflow experiment tracking for evaluation jobsCopiar enlaceEnlace copiado en el portapapeles!

5.16. Add a custom provider by using the APICopiar enlaceEnlace copiado en el portapapeles!

5.17. Add a custom provider by using a ConfigMapCopiar enlaceEnlace copiado en el portapapeles!

5.18. Add a collection by using a ConfigMapCopiar enlaceEnlace copiado en el portapapeles!

5.19. Write a custom evaluation adapterCopiar enlaceEnlace copiado en el portapapeles!

5.20. EvalHub API endpointsCopiar enlaceEnlace copiado en el portapapeles!

5.20.1. Evaluation job endpointsCopiar enlaceEnlace copiado en el portapapeles!

5.20.2. Provider endpointsCopiar enlaceEnlace copiado en el portapapeles!

5.20.3. Collection endpointsCopiar enlaceEnlace copiado en el portapapeles!

5.20.4. Health and observability endpointsCopiar enlaceEnlace copiado en el portapapeles!

5.21. EvalHub configuration referenceCopiar enlaceEnlace copiado en el portapapeles!

5.21.1. Service configurationCopiar enlaceEnlace copiado en el portapapeles!

5.21.2. Database configurationCopiar enlaceEnlace copiado en el portapapeles!

5.21.3. MLflow configurationCopiar enlaceEnlace copiado en el portapapeles!

5.21.4. OpenTelemetry configurationCopiar enlaceEnlace copiado en el portapapeles!

5.22. EvalHub multi-tenancy and RBACCopiar enlaceEnlace copiado en el portapapeles!

5.23. Set up a tenant namespaceCopiar enlaceEnlace copiado en el portapapeles!

5.24. Grant access to EvalHubCopiar enlaceEnlace copiado en el portapapeles!

5.25. EvalHub roles referenceCopiar enlaceEnlace copiado en el portapapeles!

5.26. Additional resourcesCopiar enlaceEnlace copiado en el portapapeles!

Aprender

Pruebe, compre y venda

Comunidades

Acerca de Red Hat

Hacer que el código abierto sea más inclusivo

Acerca de la documentación de Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

5.1. Understanding EvalHub
Copiar enlace

5.1.1. Core concepts
Copiar enlace

5.2. EvalHub architecture overview
Copiar enlace

5.3. Deploy EvalHub with the TrustyAI Operator
Copiar enlace

5.4. EvalHub multi-tenancy
Copiar enlace

5.5. List EvalHub providers and benchmarks
Copiar enlace

5.6. Submit an evaluation job
Copiar enlace

5.7. Track evaluation jobs and results
Copiar enlace

5.8. Cancel and delete jobs
Copiar enlace

5.9. EvalHub built-in collections
Copiar enlace

5.10. Create a custom collection in EvalHub
Copiar enlace

5.11. Configure API key authentication for model endpoints
Copiar enlace

5.12. Authenticate models with a ServiceAccount token
Copiar enlace

5.13. Use custom data from S3 for EvalHub evaluations
Copiar enlace

5.14. Export evaluation results to an OCI registry
Copiar enlace

5.15. Configure MLflow experiment tracking for evaluation jobs
Copiar enlace

5.16. Add a custom provider by using the API
Copiar enlace

5.17. Add a custom provider by using a ConfigMap
Copiar enlace

5.18. Add a collection by using a ConfigMap
Copiar enlace

5.19. Write a custom evaluation adapter
Copiar enlace

5.20. EvalHub API endpoints
Copiar enlace

5.20.1. Evaluation job endpoints
Copiar enlace

5.20.2. Provider endpoints
Copiar enlace

5.20.3. Collection endpoints
Copiar enlace

5.20.4. Health and observability endpoints
Copiar enlace

5.21. EvalHub configuration reference
Copiar enlace

5.21.1. Service configuration
Copiar enlace

5.21.2. Database configuration
Copiar enlace

5.21.3. MLflow configuration
Copiar enlace

5.21.4. OpenTelemetry configuration
Copiar enlace

5.22. EvalHub multi-tenancy and RBAC
Copiar enlace

5.23. Set up a tenant namespace
Copiar enlace

5.24. Grant access to EvalHub
Copiar enlace

5.25. EvalHub roles reference
Copiar enlace

5.26. Additional resources
Copiar enlace