Agent Evals - Docs by LangChain中文

评估（“evals”）通过评估 Agent 的执行轨迹（即其生成的消息序列和工具调用）来衡量 Agent 的表现。与用于验证基本正确性的集成测试不同，评估会根据参考答案或评分标准对 Agent 行为进行打分，因此在你修改提示词、工具或模型时，它们对于发现回归问题非常有用。评估器（evaluator）是一个接收 Agent 输出（以及可选的参考输出）并返回评分的函数：

function evaluator({ outputs, referenceOutputs }: {
  outputs: Record<string, any>;
  referenceOutputs: Record<string, any>;
}) {
  const outputMessages = outputs.messages;
  const referenceMessages = referenceOutputs.messages;
  const score = compareMessages(outputMessages, referenceMessages);
  return { key: "evaluator_score", score: score };
}

agentevals 包提供了用于 Agent 轨迹的预构建评估器。你可以通过执行轨迹匹配（确定性比较）或使用LLM 评审器（定性评估）来进行评估：

方法	适用场景
轨迹匹配	你知道预期的工具调用，并希望进行快速、确定性、零成本的检查
LLM 作为评审器	你希望评估整体质量和推理过程，而不依赖严格的预期结果

安装 AgentEvals

npm install agentevals @langchain/core

或者，直接克隆 AgentEvals 仓库。

轨迹匹配评估器

AgentEvals 提供了 createTrajectoryMatchEvaluator 函数，用于将 Agent 的轨迹与参考轨迹进行匹配。共有四种模式：

模式	描述	使用场景
`strict`	消息结构和工具调用必须完全一致，且顺序相同（消息内容可以不同）	测试特定执行顺序（例如先查询策略再授权）
`unordered`	与参考轨迹具有相同的消息结构和工具调用，但工具调用顺序可以不同	验证信息检索结果，而不关心调用顺序
`subset`	Agent 只能调用参考轨迹中的工具（不允许额外工具）	确保 Agent 不超出预期范围
`superset`	Agent 至少调用参考轨迹中的工具（允许额外工具）	验证已执行最低要求的操作

下面的示例共享同一个基础设置：一个带有 get_weather 工具的 Agent。

import { createAgent } from "langchain";
import { tool } from "@langchain/core/tools";
import { HumanMessage, AIMessage, ToolMessage } from "@langchain/core/messages";
import { createTrajectoryMatchEvaluator } from "agentevals";
import * as z from "zod";

const getWeather = tool(
  async ({ city }) => {
    return `It's 75 degrees and sunny in ${city}.`;
  },
  {
    name: "get_weather",
    description: "获取城市天气信息。",
    schema: z.object({ city: z.string() }),
  }
);

const agent = createAgent({
  model: "claude-sonnet-4-6",
  tools: [getWeather],
});

严格匹配

strict 模式要求轨迹包含完全相同的消息，并按照相同顺序执行相同的工具调用，但允许消息内容有所不同。当你需要强制执行特定操作顺序时（例如要求先进行策略查询再授权操作），这种模式非常有用。

const evaluator = createTrajectoryMatchEvaluator({
  trajectoryMatchMode: "strict",
});

async function testWeatherToolCalledStrict() {
  const result = await agent.invoke({
    messages: [new HumanMessage("旧金山天气怎么样？")]
  });

  const referenceTrajectory = [
    new HumanMessage("旧金山天气怎么样？"),
    new AIMessage({
      content: "",
      tool_calls: [
        { id: "call_1", name: "get_weather", args: { city: "San Francisco" } }
      ]
    }),
    new ToolMessage({
      content: "It's 75 degrees and sunny in San Francisco.",
      tool_call_id: "call_1"
    }),
    new AIMessage("The weather in San Francisco is 75 degrees and sunny."),
  ];

  const evaluation = await evaluator({
    outputs: result.messages,
    referenceOutputs: referenceTrajectory
  });
  expect(evaluation.score).toBe(true);
}

无序匹配

unordered 模式允许相同的工具调用以任意顺序出现。当你只关心是否获取到了特定信息，而不关心获取顺序时，这种模式非常有帮助。例如，一个 Agent 可能通过不同的工具调用同时查询某城市的天气和活动信息。

const getEvents = tool(
  async ({ city }: { city: string }) => {
    return `Concert at the park in ${city} tonight.`;
  },
  {
    name: "get_events",
    description: "获取城市活动信息。",
    schema: z.object({ city: z.string() }),
  }
);

const agent = createAgent({
  model: "claude-sonnet-4-6",
  tools: [getWeather, getEvents],
});

const evaluator = createTrajectoryMatchEvaluator({
  trajectoryMatchMode: "unordered",
});

async function testMultipleToolsAnyOrder() {
  const result = await agent.invoke({
    messages: [new HumanMessage("今天旧金山有什么活动？")]
  });

  const referenceTrajectory = [
    new HumanMessage("今天旧金山有什么活动？"),
    new AIMessage({
      content: "",
      tool_calls: [
        { id: "call_1", name: "get_events", args: { city: "SF" } },
        { id: "call_2", name: "get_weather", args: { city: "SF" } },
      ]
    }),
    new ToolMessage({
      content: "Concert at the park in SF tonight.",
      tool_call_id: "call_1"
    }),
    new ToolMessage({
      content: "It's 75 degrees and sunny in SF.",
      tool_call_id: "call_2"
    }),
    new AIMessage("Today in SF: 75 degrees and sunny with a concert at the park tonight."),
  ];

  const evaluation = await evaluator({
    outputs: result.messages,
    referenceOutputs: referenceTrajectory,
  });
  expect(evaluation.score).toBe(true);
}

子集匹配与超集匹配

superset 和 subset 模式用于匹配部分轨迹。superset 模式验证 Agent 至少调用了参考轨迹中的工具，允许额外工具调用；subset 模式则确保 Agent 不会调用参考轨迹之外的工具。

const getDetailedForecast = tool(
  async ({ city }: { city: string }) => {
    return `Detailed forecast for ${city}: sunny all week.`;
  },
  {
    name: "get_detailed_forecast",
    description: "获取城市详细天气预报。",
    schema: z.object({ city: z.string() }),
  }
);

const agent = createAgent({
  model: "claude-sonnet-4-6",
  tools: [getWeather, getDetailedForecast],
});

const evaluator = createTrajectoryMatchEvaluator({
  trajectoryMatchMode: "superset",
});

async function testAgentCallsRequiredToolsPlusExtra() {
  const result = await agent.invoke({
    messages: [new HumanMessage("波士顿天气怎么样？")]
  });

  const referenceTrajectory = [
    new HumanMessage("波士顿天气怎么样？"),
    new AIMessage({
      content: "",
      tool_calls: [
        { id: "call_1", name: "get_weather", args: { city: "Boston" } },
      ]
    }),
    new ToolMessage({
      content: "It's 75 degrees and sunny in Boston.",
      tool_call_id: "call_1"
    }),
    new AIMessage("The weather in Boston is 75 degrees and sunny."),
  ];

  const evaluation = await evaluator({
    outputs: result.messages,
    referenceOutputs: referenceTrajectory,
  });
  expect(evaluation.score).toBe(true);
}

你还可以设置 toolArgsMatchMode 属性和/或 toolArgsMatchOverrides 属性，以自定义评估器判断实际轨迹与参考轨迹中的工具调用是否相等的方式。默认情况下，只有调用同一工具且参数完全相同的工具调用才会被视为相等。更多细节请参阅仓库文档。

LLM 作为评审器（LLM-as-judge）评估器

你可以使用 createTrajectoryLLMAsJudge 函数，通过 LLM 来评估 Agent 的执行路径。与轨迹匹配评估器不同，它不需要参考轨迹，但如果有参考轨迹，也可以提供。

不使用参考轨迹

import { createTrajectoryLLMAsJudge, TRAJECTORY_ACCURACY_PROMPT } from "agentevals";

const evaluator = createTrajectoryLLMAsJudge({
  model: "openai:o3-mini",
  prompt: TRAJECTORY_ACCURACY_PROMPT,
});

async function testTrajectoryQuality() {
  const result = await agent.invoke({
    messages: [new HumanMessage("西雅图天气怎么样？")]
  });

  const evaluation = await evaluator({
    outputs: result.messages,
  });
  expect(evaluation.score).toBe(true);
}

使用参考轨迹

如果你有参考轨迹，请使用预构建的 TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE 提示词：

import { createTrajectoryLLMAsJudge, TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE } from "agentevals";

const evaluator = createTrajectoryLLMAsJudge({
  model: "openai:o3-mini",
  prompt: TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE,
});

const evaluation = await evaluator({
  outputs: result.messages,
  referenceOutputs: referenceTrajectory,
});

如需对 LLM 如何评估轨迹进行更多自定义配置，请访问仓库。

在 LangSmith 中运行评估

为了跟踪长期实验，请将评估器结果记录到 LangSmith。首先，设置所需的环境变量：

export LANGSMITH_API_KEY="your_langsmith_api_key"
export LANGSMITH_TRACING="true"

LangSmith 提供两种主要的评估运行方式：Vitest/Jest 集成和 evaluate 函数。

使用 vitest/jest 集成

import * as ls from "langsmith/vitest";
// import * as ls from "langsmith/jest";

import { createTrajectoryLLMAsJudge, TRAJECTORY_ACCURACY_PROMPT } from "agentevals";

const trajectoryEvaluator = createTrajectoryLLMAsJudge({
  model: "openai:o3-mini",
  prompt: TRAJECTORY_ACCURACY_PROMPT,
});

ls.describe("轨迹准确性", () => {
  ls.test("准确的轨迹", {
    inputs: {
      messages: [
        { role: "user", content: "旧金山的天气怎么样？" }
      ]
    },
    referenceOutputs: {
      messages: [
        new HumanMessage("旧金山的天气怎么样？"),
        new AIMessage({
          content: "",
          tool_calls: [
            { id: "call_1", name: "get_weather", args: { city: "SF" } }
          ]
        }),
        new ToolMessage({
          content: "旧金山当前 75 华氏度，天气晴朗。",
          tool_call_id: "call_1"
        }),
        new AIMessage("旧金山当前 75 华氏度，天气晴朗。"),
      ],
    },
  }, async ({ inputs, referenceOutputs }) => {
    const result = await agent.invoke({
      messages: [new HumanMessage("旧金山的天气怎么样？")]
    });

    ls.logOutputs({ messages: result.messages });

    await trajectoryEvaluator({
      inputs,
      outputs: result.messages,
      referenceOutputs,
    });
  });
});

使用测试运行器执行评估：

vitest run test_trajectory.eval.ts
# 或
jest test_trajectory.eval.ts

使用 evaluate 函数

创建一个 LangSmith 数据集，并使用 evaluate 函数。该数据集必须具有以下 Schema：

input：{"messages": [...]}，用于调用 Agent 的输入消息。
output：{"messages": [...]}，Agent 输出中期望的消息历史。对于轨迹（trajectory）评估，你可以选择仅保留 Assistant 消息。

import { evaluate } from "langsmith/evaluation";
import { createTrajectoryLLMAsJudge, TRAJECTORY_ACCURACY_PROMPT } from "agentevals";

const trajectoryEvaluator = createTrajectoryLLMAsJudge({
  model: "openai:o3-mini",
  prompt: TRAJECTORY_ACCURACY_PROMPT,
});

async function runAgent(inputs: any) {
  const result = await agent.invoke(inputs);
  return result.messages;
}

await evaluate(
  runAgent,
  {
    data: "your_dataset_name",
    evaluators: [trajectoryEvaluator],
  }
);

要了解有关评估 Agent 的更多信息，请参阅 LangSmith 文档。

Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

Edit this page on GitHub or file an issue.

​安装 AgentEvals

​轨迹匹配评估器

​LLM 作为评审器（LLM-as-judge）评估器

​在 LangSmith 中运行评估

安装 AgentEvals

轨迹匹配评估器

LLM 作为评审器（LLM-as-judge）评估器

在 LangSmith 中运行评估