Grading rubrics - Docs by LangChain中文

RubricMiddleware 需要 deepagents>=0.6.5。它处于 beta 阶段；API 未来可能会变化。

有些 agent tasks 对 “done” 有清晰定义，但仅靠 working model 不一定能在第一次尝试时可靠达成：例如符合音节格式的 haiku、所有 tests 都通过的 refactor、覆盖每个 required section 的 report。RubricMiddleware 让你把 done looks like 声明为 rubric，并让 agent self-evaluate and iterate，直到满足 rubric（或达到配置的 maximum iteration cap）。 LLM-as-a-judge 是一种 pattern：一个 language model 根据 defined criteria 评估另一个 model 的 output。在 LangSmith evaluations 中，LLM-as-a-judge evaluators 会离线批量为 application outputs 打分。RubricMiddleware 在 runtime 应用同一 pattern：deep agent 生成 output 后，dedicated grader model 会根据你的 rubric review transcript，并驱动 revision，直到每个 criterion 都通过（或达到配置的 iteration cap）。当 deep agent 完成 reasoning 后，LLM-as-a-judge grader sub-agent 会 review output 并返回 verdict。如果它返回 needs_revision，per-criterion feedback 会注入回 conversation，agent 会再次运行。循环会在 satisfied、max_iterations_reached、failed 或 grader_error 时终止。

Configure the middleware

调用 create_deep_agent 时，将 RubricMiddleware 添加到 middleware list：

from deepagents import RubricMiddleware, create_deep_agent
from langgraph.checkpoint.memory import InMemorySaver

agent = create_deep_agent(
    model="google_genai:gemini-3.5-flash",
    middleware=[
        RubricMiddleware(
            model="anthropic:claude-haiku-4-5",
            max_iterations=3,
        ),
    ],
    checkpointer=InMemorySaver(),
)

Argument	Required	Default	Description
`model`	Yes	`None`	LLM-as-a-judge grader sub-agent 使用的 chat model。接受 `"provider:model-id"` string 或 `BaseChatModel` instance。通常是比 deep agent 的 working model 更小或更便宜的 model。
`system_prompt`	No	Built-in grader prompt	Custom grading instructions。若未提供，则 fallback 到 default system prompt，用于教 grader verdict format 以及可用 tools。
`tools`	No	`None`	Grader 在生成 verdict 前可以调用以收集 evidence 的 tools（运行 tests、count tokens、read files）。如果不提供，grader 只基于 transcript reasoning。
`max_iterations`	No	`3`	每次 rubric attempt 的 grader iterations hard cap。最大 input value 为 20。如果达到 cap 时仍没有 `satisfied` verdict，agent 会以 `max_iterations_reached` status 终止。
`on_evaluation`	No	`None`	Optional callback，无论你使用 `invoke()`、`stream()` 还是 `stream_events()`，都会在每次 grading iteration 后以每个 `RubricEvaluation` 调用。适用于 logging、custom metrics、eval datasets 或 UI updates。

Pass rubric on invocation

在 invocation state 上传入 rubric string 以启动 self-evaluation loop。使用 invoke() 进行 single blocking call，或将 stream_events(..., version="v3") 与 CustomTransformer 配合使用，在 grading events 发生时通过 stream.custom 接收：

invoke()
stream_events()

from langchain.messages import HumanMessage

config = {"configurable": {"thread_id": "my-rubric-thread"}}
result = agent.invoke(
    {
        "messages": [HumanMessage("Write a haiku about spring.")],
        "rubric": (
            "- The poem has three lines\n"
            "- Lines follow a 5-7-5 syllable pattern\n"
            "- The theme is spring"
        ),
    },
    config=config,
)

from langchain.messages import HumanMessage
from langgraph.stream import CustomTransformer

config = {"configurable": {"thread_id": "my-rubric-thread"}}
stream = agent.stream_events(
    {
        "messages": [HumanMessage("Write a haiku about spring.")],
        "rubric": (
            "- The poem has three lines\n"
            "- Lines follow a 5-7-5 syllable pattern\n"
            "- The theme is spring"
        ),
    },
    config=config,
    version="v3",
    transformers=[CustomTransformer],
)

for event in stream.custom:
    event_type = event.get("type")
    if event_type == "rubric_evaluation_start":
        print(
            f"Grading iteration {event['iteration']} "
            f"(run {event['grading_run_id']})"
        )
    elif event_type == "rubric_evaluation_end":
        print(f"Verdict: {event['result']} — {event.get('explanation', '')}")

Rubric grading 会在 stream.custom 上 emit 以下 custom events：

Event	When fired	Payload fields
`rubric_evaluation_start`	Grader 运行前。	`type`：event name `grading_run_id`：一个 rubric attempt 内所有 events 共享 `iteration`：current grading run 的 zero-based index
`rubric_evaluation_end`	Grader 返回后，或 grader exception 后。	`type`：event name `grading_run_id`：一个 rubric attempt 内所有 events 共享 `iteration`：current grader pass 的 zero-based index `result`：此 pass 的 terminal verdict `explanation`：grader 返回的 summary `criteria`：per-criterion verdicts

Rubric verdicts

当 deep agent 完成 reasoning 并产生 output 后，LLM-as-a-judge grader sub-agent 会根据 rubric review output，并生成以下 verdicts 之一：

Status	Meaning	Loops back?
`satisfied`	Rubric 中的每个 criterion 都通过。	No
`needs_revision`	至少一个 criterion 失败；grader feedback 会注入，agent 会再次运行。	Yes
`max_iterations_reached`	Grader 仍希望 revision，但已达到 `max_iterations`。	No
`failed`	Grader 判断 rubric malformed，或无法根据 transcript 评估。	No
`grader_error`	LLM-as-a-judge grader sub-agent 自身引发 exception（provider timeout、missing credentials、malformed structured response 等）。	No

Observe iteration progress

on_evaluation 是一个 callback，无论你调用 invoke() 还是 stream_events()，它都会在每次 grading iteration 后携带 grader verdict 触发。如果你没有从 stream.custom（通过 CustomTransformer）读取 rubric events，也没有 tracing the run with LangSmith，它就是检查 grading 期间发生了什么的主要方式。

from deepagents import RubricMiddleware, create_deep_agent
from deepagents.middleware.rubric import RubricEvaluation
from langchain.messages import HumanMessage
from langgraph.checkpoint.memory import InMemorySaver


def log_evaluation(ev: RubricEvaluation) -> None:
    print(f"iteration {ev['iteration']}: {ev['result']} — {ev['explanation']}")


agent = create_deep_agent(
    model="google_genai:gemini-3.5-flash",
    middleware=[
        RubricMiddleware(
            model="anthropic:claude-haiku-4-5",
            on_evaluation=log_evaluation,
        ),
    ],
    checkpointer=InMemorySaver(),
)

config = {"configurable": {"thread_id": "rubric-eval-session"}}
agent.invoke(
    {
        "messages": [HumanMessage("Write a one-sentence summary of photosynthesis.")],
        "rubric": (
            "- The answer is one sentence\n"
            "- The answer mentions light and chlorophyll"
        ),
    },
    config=config,
)

Middleware 会在每次 grader pass 后，用 RubricEvaluation dictionary 调用你的 function。RubricEvaluation dictionary 包含：

Field	Type	Description
`grading_run_id`	`str`	一个 rubric attempt 中每次 evaluation 共享的 identifier。当 caller 提供不同 `rubric`，或同一 `rubric` 在 terminal verdict 后再次 invoked 时，会启动 new run。
`iteration`	`int`	该 run 内 current grader pass 的 zero-based index。
`result`	`str`	此 pass 的 grader verdict：`satisfied`、`needs_revision`、`failed` 或 `grader_error`。
`explanation`	`str`	Grader 返回的 free-form summary。发生 infrastructure failures 时，这包括 exception type 和 message。
`criteria`	`list`	Per-criterion verdicts。每个 entry 要么是 `{name, passed: true}`，要么是 `{name, passed: false, gap}`，其中 `gap` 是针对 failing criterion 的 actionable feedback。

Grader pass events

Event	Description
Successful grading	每个 pass 触发一次，包括 intermediate `needs_revision` verdicts，以及最终的 `satisfied` 或 `failed` verdict。当 grader 返回 `needs_revision` 但已达到 `max_iterations` 时，callback 仍会收到 `result: "needs_revision"`（grader 的 verdict）。Run 的 terminal status 是 private state `_rubric_status` 上的 `max_iterations_reached`，不在 evaluation record 上。`invoke` 完成后检查 `_rubric_status`，或读取 `_rubric_evaluations` 的最后一项并结合 `_rubric_iterations`，以根据 cap exhaustion 分支。
Grader exceptions	以 `result: "grader_error"` 触发，包含从 exception 派生的 explanation 和 empty `criteria` list。
Errors in your callback	Exceptions 会被记录并 suppress。Grading loop 会继续。不要使用 `on_evaluation` 强制控制流（例如 raise 来停止 agent）。

Persist rubrics across invocations

单次 agent.invoke() 或 agent.stream_events() call 会运行 rubric loop 直到完成，并以 terminal verdict 结束：satisfied、failed 或 max_iterations_reached。若要将 rubrics 延续到 follow up invocations，请附加 checkpointer，并在 invocation 时传入同一个 thread_id。在这些情况下，同一个 rubric 会跨未来的 invoke() 或 stream_events() calls 持续存在，直到你传入新的 rubric。 Interrupts（KeyboardInterrupt、asyncio.CancelledError）会从 agent.invoke() 和 agent.stream_events() uncaught 地向外传播。在 checkpointed thread 上，下一次使用相同 rubric 的 call 会 resume in-flight grading run。

Example: generate vetted Python code

下面的示例构建一个 deep agent，用于编写 find_duplicates function。它定义一次 RubricMiddleware，将其附加到 agent，然后在 invoke time 传入 rubric string。该示例没有要求 grader 抽象地 reasoning correctness，而是给它一个 run_test_suite tool 来直接验证 behavior。Grader 在生成 verdict 前调用此 tool 获取 additional information；当没有提供 tools 时，则 fallback 到基于 transcript reasoning。

Define RubricMiddleware

此 middleware 会在 base agent 之上添加 LLM-as-a-judge grader loop。配置 grader model、optional custom prompt、用于 evidence gathering 的 tools，以及 maximum iteration cap。

from deepagents import RubricMiddleware
from langchain.tools import tool


@tool
def run_test_suite(code: str) -> dict:
    """Run the find_duplicates test suite against Python source code."""
    namespace: dict = {"__builtins__": __builtins__}
    try:
        exec(code, namespace)
    except Exception as exc:
        return {"ok": False, "failures": [f"Failed to execute code: {exc}"]}

    find_duplicates = namespace.get("find_duplicates")
    if find_duplicates is None:
        return {"ok": False, "failures": ["Function find_duplicates is not defined"]}

    tests = [
        ("test_basic", [1, 2, 2, 3, 1], [2, 1]),
        ("test_empty", [], []),
        ("test_no_duplicates", [1, 2, 3], []),
        ("test_unhashable", [[1], [1], 2], [[1]]),
    ]
    failures: list[str] = []
    for name, args, expected in tests:
        try:
            actual = find_duplicates(args)
            if actual != expected:
                failures.append(f"{name}: expected {expected}, got {actual}")
        except Exception as exc:
            failures.append(f"{name}: {exc}")

    return {"ok": not failures, "failures": failures}


rubric_middleware = RubricMiddleware(
    model="google_genai:gemini-3.5-flash",
    system_prompt="You are a code reviewer grading generated code against a rubric.",
    tools=[run_test_suite],
    max_iterations=5,
)

Pass it to a deep agent

Agent 的 system_prompt 告诉它如何执行工作，而 rubric 告诉 grader 如何判断工作。

from deepagents import create_deep_agent
from langgraph.checkpoint.memory import InMemorySaver

agent = create_deep_agent(
    model="google_genai:gemini-3.5-flash",
    system_prompt=(
        "You are a careful Python engineer. Write correct, readable code. "
        "Follow the user's instructions exactly."
    ),
    middleware=[rubric_middleware],
    checkpointer=InMemorySaver(),
)

Invoke with a human message and rubric

在 invocation time，在 messages 中提供 user request，并在 rubric 中提供 newline-delimited checklist，grader 必须将其标记为 satisfied。当 input state 未提供 rubric 时，middleware 不会运行。

from langchain.messages import HumanMessage

if __name__ == "__main__":
    result = agent.invoke(
        {
            "messages": [
                HumanMessage(
                    content=(
                        "Write a Python function `find_duplicates(lst)` that returns a list of "
                        "all elements that appear more than once in the input list, in the order "
                        "they first appear."
                    )
                )
            ],
            "rubric": (
                "- All tests pass in run_test_suite\n"
                "- The function is named `find_duplicates` and accepts a single list argument\n"
            ),
        },
        config={"configurable": {"thread_id": "code-generation-session"}},
    )
    print(result["messages"][-1].text)

Agent 生成 output 后，grader 会接管并按每个 criterion 检查 output：例如，当 input 包含 unhashable types 时，test_unhashable 是否以 TypeError 失败。如果存在任何问题，grader 会提供此 feedback，agent 随后修改 implementation 并将其返回给 grader。

Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

Edit this page on GitHub or file an issue.

​Configure the middleware

​Pass rubric on invocation

​Rubric verdicts

​Observe iteration progress

​Grader pass events

​Persist rubrics across invocations

​Example: generate vetted Python code

Configure the middleware

Pass rubric on invocation

Rubric verdicts

Observe iteration progress

Grader pass events

Persist rubrics across invocations

Example: generate vetted Python code