如何评估智能体

本指南将引导你完成在 Youtu-Agent 中评估智能体的完整过程，从实现到分析。

先决条件

在开始之前，请确保：

环境已设置：你已完成快速入门设置
数据库已配置：UTU_DB_URL 环境变量已在你的 .env 文件中设置（默认为 sqlite:///test.db）
API 密钥已配置：UTU_LLM_API_KEY 和其他必需的密钥已在 .env 中设置

概览

评估过程包含四个主要步骤：

实现和调试你的智能体 - 交互式地测试你的智能体
准备评估数据集 - 创建并上传测试用例
运行评估 - 执行自动化评估
分析结果 - 使用分析仪表板查看性能

步骤 1：实现和调试你的智能体

在运行正式评估之前，你应该交互式地实现和测试你的智能体，以确保它按预期工作。

1.1 创建智能体配置

在 configs/agents/ 中创建智能体配置文件。例如，configs/agents/examples/svg_generator.yaml：

# @package _global_
defaults:
  - /model/base@orchestrator_model
  - /agents/router/force_plan@orchestrator_router
  - /agents/simple/search_agent@orchestrator_workers.SearchAgent
  - /agents/simple/svg_generator@orchestrator_workers.SVGGenerator
  - _self_

type: orchestrator

orchestrator_config:
  name: SVGGeneratorAgent
  add_chitchat_subagent: false
  additional_instructions: |-
    Your main objective is generating SVG code according to user's request.

orchestrator_workers_info:
  - name: SearchAgent
    description: Performs focused information retrieval
  - name: SVGGenerator
    description: Creates SVG cards

1.2 交互式测试

运行 CLI 聊天界面以验证你的智能体是否正常工作：

python scripts/cli_chat.py --config examples/svg_generator

尝试各种输入并验证智能体的响应是否符合你的预期。根据需要调试和完善你的智能体配置。

步骤 2：准备评估数据集

一旦你的智能体正常工作，创建一个测试用例数据集以进行评估。

2.1 创建数据集文件

创建一个 JSONL 文件（例如，data/svg_gen.jsonl），其中每一行都是包含你的测试用例的 JSON 对象。支持两种数据格式：

格式 1：LLaMA Factory 格式（推荐用于 SFT 数据）：

{"instruction": "Research Youtu-Agent and create an SVG summary"}
{"instruction": "Introduce the data formats in LLaMA Factory"}
{"instruction": "Summarize how to use Claude Code"}
{"input": "https://docs.claude.com/en/docs/claude-code/skills"}
{"input": "Please introduce Devin coding agent"}
{"instruction": "Summarize the following content", "input": "# Model Context Protocol servers\n\nThis repository is a collection of reference implementations..."}
{"input": "https://arxiv.org/abs/2502.05957"}
{"input": "https://github.com/microsoft/agent-lightning"}
{"input": "Summarize resources about MCP"}
{"input": "https://huggingface.co/papers/2510.19779"}

在 LLaMA Factory 格式中： - instruction 和/或 input 字段组合起来创建问题 - output 字段（如果存在）成为地面实况答案

格式 2：默认格式：

{"question": "What is Youtu-Agent?", "answer": "A flexible agent framework"}
{"question": "How to install?", "answer": "Run uv sync"}

2.2 将数据集上传到数据库

使用 upload_dataset.py 脚本上传你的数据集：

python scripts/data/upload_dataset.py \
  --file_path data/svg_gen.jsonl \
  --dataset_name example_svg_gen \
  --data_format llamafactory

参数： - --file_path：你的 JSONL 文件的路径 - --dataset_name：在数据库中分配给此数据集的名称 - --data_format：llamafactory 或 default

该脚本将解析你的 JSONL 文件并将测试用例存储在数据库中。你应该看到类似以下的输出：

Uploaded 10 datapoints from data/svg_gen.jsonl to dataset 'example_svg_gen'.
Upload complete.

步骤 3：运行评估

现在在评估数据集上运行你的智能体。

3.1 创建评估配置

在 configs/eval/examples/ 中创建评估配置文件。例如，configs/eval/examples/eval_svg_generator.yaml：

# @package _global_
defaults:
  - /agents/examples/svg_generator@agent
  - _self_

exp_id: "example_svg_gen"

data:
  dataset: example_svg_gen

concurrency: 16

关键字段： - defaults：引用你的智能体配置 - exp_id：此评估运行的唯一标识符 - data.dataset：你上传的数据集的名称 - concurrency：并行评估任务的数量

3.2 运行评估脚本

执行评估：

python scripts/run_eval.py \
  --config_name examples/eval_svg_generator \
  --exp_id example_svg_gen \
  --dataset example_svg_gen \
  --step rollout

参数： - --config_name：你的评估配置名称（不带 .yaml 扩展名） - --exp_id：此评估运行的唯一标识符 - --dataset：要评估的数据集名称 - --step：要运行的评估步骤 - rollout：在所有测试用例上运行智能体（收集轨迹） - judge：评估智能体的输出（需要地面实况答案） - all：运行 rollout 和 judge 步骤

注意： 当你的数据集没有地面实况答案 (GT) 时，使用 --step rollout。如果你有 GT 答案，你可以运行 --step all 或单独运行 judge：

# 单独运行 judge（需要先完成 rollout）
python scripts/run_eval_judge.py \
  --config_name examples/eval_svg_generator \
  --exp_id example_svg_gen

3.3 监控进度

在评估期间，你将看到进度指示器：

> Loading dataset 'example_svg_gen'...
> Found 10 test cases
> Starting rollout with concurrency 16...
[1/10] Processing: Research Youtu-Agent...
[2/10] Processing: Introduce LLaMA Factory...
...
> Evaluation complete. Results saved to database.

步骤 4：分析结果

评估完成后，使用基于 Web 的分析仪表板分析结果。

4.1 设置分析仪表板

评估分析仪表板是一个位于 frontend/exp_analysis/ 中的 Next.js 应用程序。

首次设置：

cd frontend/exp_analysis

# 安装依赖项
npm install --legacy-peer-deps

# 构建应用程序
npm run build

4.2 启动仪表板

# 确保你在 frontend/exp_analysis/ 中
npm run start

仪表板默认可通过 http://localhost:3000 访问。

更改端口（可选）：编辑 package.json 并修改启动脚本：

"scripts": {
  "start": "next start -p 8080"
}

4.3 查看评估结果

打开浏览器并导航到 http://localhost:3000。仪表板提供：

概览：你的评估运行的汇总统计数据
实验比较：并排比较多个评估运行
详细轨迹：检查单个智能体执行轨迹
成功/失败分析：识别成功与失败案例中的模式
性能指标：准确率、延迟、令牌使用情况等

主要特性： - 按实验 ID、数据集或日期范围过滤 - 查看完整的智能体轨迹，包括工具调用和推理步骤 - 导出结果以进行进一步分析 - 比较不同的智能体配置

4.4 验证数据库连接

如果遇到问题，请测试数据库连接：

cd frontend/exp_analysis
npm run test:db

这将验证仪表板是否可以连接到你的数据库并检索评估数据。

高级主题

在标准基准上评估

要在 WebWalkerQA 或 GAIA 等标准基准上进行评估：

# 准备基准数据集（一次性设置）
python scripts/data/process_web_walker_qa.py

# 在 WebWalkerQA 上运行评估
python scripts/run_eval.py \
  --config_name ww \
  --exp_id my_ww_run \
  --dataset WebWalkerQA_15 \
  --concurrency 5

有关基准的更多详细信息，请参阅评估文档。

配置裁判模型

对于带有地面实况的评估，在你的评估配置中配置裁判模型：

judge_model:
  model_provider:
    type: ${oc.env:JUDGE_LLM_TYPE}
    model: ${oc.env:JUDGE_LLM_MODEL}
    base_url: ${oc.env:JUDGE_LLM_BASE_URL}
    api_key: ${oc.env:JUDGE_LLM_API_KEY}
  model_params:
    temperature: 0.5
judge_concurrency: 16

在你的 .env 文件中设置相应的环境变量：

JUDGE_LLM_TYPE=chat.completions
JUDGE_LLM_MODEL=deepseek-chat
JUDGE_LLM_BASE_URL=https://api.deepseek.com/v1
JUDGE_LLM_API_KEY=your-judge-api-key