A walkthrough of how Reforge Build evaluates the performance of chat-based agents using Workflows.
An evaluation (“eval”) is a test for an AI system: give an AI an input, then apply grading logic to its output to measure success.
Generating evals
For generating evals that measure something meaningful, we lean on this post by Cole Hoffer — using human follow-up behavior as ground truth and scoring against specific instruction adherence instead of vague relevance.

A trace is step-by-step visibility into an agent's execution — reasoning, tool calls, sub-agent runs, and latency, captured as a nested timeline.
Our chat agent writes traces to Braintrust during every turn. It's how we debug why a response went wrong.
Evals read those same traces post-hoc — no re-running the agent, just scoring what already happened.
Here's your landing page.
Each span runs through a flow of evaluation steps — some LLM-as-judge, some deterministic. Together they let us speculate how well the agent did, grounded in the user's initial ask and the logs captured in the trace.
This example shows a customer requesting a prototype. The orchestrator agent calls the coding agent that handles the creation of the proto.
Here is your prototype.
Updated — quiz mode with multiple-choice prompts.
By default, eval results pile up in Braintrust — graphs on graphs on graphs. The team watches the dashboard the first week after launch, then forgets it exists.
That's a lot of signal going nowhere. We want evals to nudge action — not wait for someone to go look.
With evals in place, we head to Braintrust to see how the agent is performing.
By default, eval results pile up in dashboards — endless, unwatched.
Our existing async-jobs vendor has been unreliable. We trialed Workflows for this project and it hit the mark — the eval pipeline ran cleanly throughout.
Evals have to keep up with the product. The step primitive stays out of the way — you write a regular async function, wrap it in step(), and it's durable. Fast to ship, easy to undo.
We're a Vercel shop. Keeping ops under one roof means simpler observability and less to maintain.
Seamless with the rest of our Vercel tooling. Local development in particular felt great — that was a hard requirement for the team when shopping for a new async-jobs vendor.