Skip to main content

Agent Evaluation

Agent evaluation tests how well an agent's orchestration works and whether it meets your expected capabilities and performance. You import a test dataset, run the agent against it in batch, and analyze the outputs to get objective quality metrics — so you can keep tuning the workflow as you debug.

Features

  • Batch data testing: Import scenario-based datasets to simulate user dialogues, run them in batch, and collect outputs for a full assessment of response quality.
  • AI model-assisted analysis: Let an AI model evaluate the results automatically, returning quality judgments and performance scores to speed up analysis.
  • Multi-dimensional comparison: Compare and annotate results online, benchmark across versions, and trace knowledge-base retrieval for detailed diagnostics.

Batch testing

Go to the My Agent list, select an agent, then click > Batch Testing in the Operation column.

My Agent list with the Batch Testing action
My Agent list with the Batch Testing action

Alternatively, click Manage in the Operation column to open the agent details page, then click Batch Testing in the top right corner.

Agent details page with Batch Testing button
Agent details page with Batch Testing button

note

Only agents with an officially released version can be batch tested.

AI evaluation configuration

The platform can use an AI evaluation model to analyze test results automatically. When you create an evaluation task with the AI Evaluation debug type, the system invokes the model to analyze the agent's outputs and generate a report.

  1. Click AI Evaluation Configuration in the top right corner. This configuration applies only to the current agent. Every AI evaluation task you create for this agent reuses it.

    AI Evaluation Configuration entry
    AI Evaluation Configuration entry

  2. Configure both the Evaluation Model and the Evaluation Prompt.

    Evaluation model and prompt fields
    Evaluation model and prompt fields

    Enter the prompt manually, or click Prompt Template in the lower-left corner to preview a template and click Use to apply it. Click Switch to English or Switch to Chinese to change the template language. Chinese and English are supported.

    Prompt template selection
    Prompt template selection

    When done, click OK.

  3. Each save creates a historical version of the configuration. In the History section on the right, click Details to view a version, or click Restore this version to roll back to it.

    Configuration history list
    Configuration history list

    Historical version details
    Historical version details

Create a task

On the batch testing task list, click Create Task in the top right corner. You can evaluate any released version of the agent.

Create Task entry
Create Task entry

Create Task form
Create Task form

FieldDescription
Data RegionThe data center where the agent is deployed. Task data is also saved here.
AgentThe name of the agent.
Select VersionA historical version of the current agent.
Test Task NameThe name of the evaluation task.
Debug TypeAgent Execution runs the test data and outputs results. AI Evaluation runs the agent, then has the AI evaluation model analyze the outputs and return results.
Import DataImport test data from a spreadsheet, one file at a time. Click Download Test Set Template and format your data to the template to avoid parsing failures.

When done, click Save and Execute Immediately to run the task.

Evaluation result

After the task finishes, click Details to view results online, or click Download to download the result file.

Evaluation results view
Evaluation results view

If the agent is linked to a knowledge base, click View Retrieval in the results to see the retrieval details.

Knowledge base retrieval details
Knowledge base retrieval details

FieldDescription
InputThe test case data you uploaded.
Expected OutputThe response you expect for the input.
Actual OutputThe result the agent produced.
Evaluation ResultA manual annotation. Mark each result pass or fail and add comments.
Evaluation DescriptionFor AI evaluation, the model's opinion. For agent execution, empty until you annotate it.
Additional InformationAdd remarks as needed.
Knowledge RetrievalIf the agent is linked to a knowledge base, the retrieval details for this input. Retrieval details cannot be exported.

Result comparison

You can compare any two tasks of an agent online. On the Batch Testing task list, click Result Comparison, select two historical tasks, then click Result Comparison to see the details.

Result comparison selection
Result comparison selection

Result comparison details
Result comparison details

You can also download the result files and compare them in detail locally.

Billing

Agent evaluation is free. The tokens consumed by running an evaluation task are billed at standard rates.

Go to the Resource Consume page for detailed token consumption. You can also select an agent, open the Batch Testing list, and check the Token Consumption column for each task.

Token consumption view
Token consumption view

See also