Evals with agency Powered by Workflows

A walkthrough of how Reforge Build evaluates the performance of chat-based agents using Workflows.

nextpreviousSpaceadvance

About me

Harley Siezar

Harley Siezar AI Engineer

I work on Reforge Build, a prototyping tool.

First, what are evals?

An evaluation (“eval”) is a test for an AI system: give an AI an input, then apply grading logic to its output to measure success.

Generating evals

For generating evals that measure something meaningful, we lean on this post by Cole Hoffer — using human follow-up behavior as ground truth and scoring against specific instruction adherence instead of vague relevance.

Evaluating LLM Chat Agents with Real World Signals

Traces are what we evaluate

A trace is step-by-step visibility into an agent's execution — reasoning, tool calls, sub-agent runs, and latency, captured as a nested timeline.

Our chat agent writes traces to Braintrust during every turn. It's how we debug why a response went wrong.

Evals read those same traces post-hoc — no re-running the agent, just scoring what already happened.

What a trace looks like

conversation trace
make a saas landing page
Span 1 · Discovery Tool
discovery_tool
What style are you after?
BoldMinimalPlayfulCorporate
Span 2 · Plan Tool
plan_tool
  • 1Hero with headline + CTA
  • 2Feature grid (3 columns)
  • 3Testimonials
  • 4Pricing table
  • 5Footer
build it
Span 3 · Coding Subagent

Here's your landing page.

Score each span, aggregate to health

conversation trace
Span 1 · Discovery Tool
turnEvalWorkflow()
Score: 0.89
Span 2 · Plan Tool
turnEvalWorkflow()
Score: 0.74
Span 3 · Coding Subagent
turnEvalWorkflow()
Score: 0.86

Eval rubric Workflow

TriggerClassifyScore planScore alignRoll upWaitRescoreTriggerClassifyScore plan...follow-up message
Span by span

Each span runs through a flow of evaluation steps — some LLM-as-judge, some deterministic. Together they let us speculate how well the agent did, grounded in the user's initial ask and the logs captured in the trace.

Why pause the Workflow

This example shows a customer requesting a prototype. The orchestrator agent calls the coding agent that handles the creation of the proto.

Build me a flashcard app for studying Spanish.

Here is your prototype.

Now let's add a quiz mode.

Updated — quiz mode with multiple-choice prompts.

Cool, we have evals. What do they do?

By default, eval results pile up in Braintrust — graphs on graphs on graphs. The team watches the dashboard the first week after launch, then forgets it exists.

That's a lot of signal going nowhere. We want evals to nudge action — not wait for someone to go look.

Actions based on evals

With evals in place, we head to Braintrust to see how the agent is performing.

evalsSynthetic UsersalertLinearAgentGitHubHILT

See it in motion

By default, eval results pile up in dashboards — endless, unwatched.

Why Vercel Workflows

Reliability

Our existing async-jobs vendor has been unreliable. We trialed Workflows for this project and it hit the mark — the eval pipeline ran cleanly throughout.

Velocity

Evals have to keep up with the product. The step primitive stays out of the way — you write a regular async function, wrap it in step(), and it's durable. Fast to ship, easy to undo.

One vendor

We're a Vercel shop. Keeping ops under one roof means simpler observability and less to maintain.

Dev experience

Seamless with the rest of our Vercel tooling. Local development in particular felt great — that was a hard requirement for the team when shopping for a new async-jobs vendor.

h5.codes/eval-wf
01 / 13