Evals with agency · Powered by Workflows

Evals with agency Powered by Workflows

A walkthrough of how Reforge Build evaluates the performance of chat-based agents using Workflows.

nextpreviousSpaceadvance

About me

Harley Siezar AI Engineer

I work on Reforge Build, a prototyping tool.

@hadasie·h5.codes

First, what are evals?

An evaluation (“eval”) is a test for an AI system: give an AI an input, then apply grading logic to its output to measure success.

Generating evals

For generating evals that measure something meaningful, we lean on this post by Cole Hoffer — using human follow-up behavior as ground truth and scoring against specific instruction adherence instead of vague relevance.

Evaluating LLM Chat Agents with Real World Signals

Traces are what we evaluate

A trace is step-by-step visibility into an agent's execution — reasoning, tool calls, sub-agent runs, and latency, captured as a nested timeline.

Our chat agent writes traces to Braintrust during every turn. It's how we debug why a response went wrong.

Evals read those same traces post-hoc — no re-running the agent, just scoring what already happened.

What a trace looks like

conversation trace

make a saas landing page

Span 1 · Discovery Tool

discovery_tool

What style are you after?

BoldMinimalPlayfulCorporate

Span 2 · Plan Tool

plan_tool

1Hero with headline + CTA
2Feature grid (3 columns)
3Testimonials
4Pricing table
5Footer

build it

Span 3 · Coding Subagent

Here's your landing page.

Score each span, aggregate to health

conversation trace

Span 1 · Discovery Tool

turnEvalWorkflow()

Score: 0.89

Span 2 · Plan Tool

turnEvalWorkflow()

Score: 0.74

Span 3 · Coding Subagent

turnEvalWorkflow()

Score: 0.86

Eval rubric Workflow

Span by span

Each span runs through a flow of evaluation steps — some LLM-as-judge, some deterministic. Together they let us speculate how well the agent did, grounded in the user's initial ask and the logs captured in the trace.

Why pause the Workflow

This example shows a customer requesting a prototype. The orchestrator agent calls the coding agent that handles the creation of the proto.

Build me a flashcard app for studying Spanish.

Here is your prototype.

Now let's add a quiz mode.

Updated — quiz mode with multiple-choice prompts.

Cool, we have evals. What do they do?

By default, eval results pile up in Braintrust — graphs on graphs on graphs. The team watches the dashboard the first week after launch, then forgets it exists.

That's a lot of signal going nowhere. We want evals to nudge action — not wait for someone to go look.

Actions based on evals

With evals in place, we head to Braintrust to see how the agent is performing.

See it in motion

By default, eval results pile up in dashboards — endless, unwatched.

Why Vercel Workflows

Reliability

Our existing async-jobs vendor has been unreliable. We trialed Workflows for this project and it hit the mark — the eval pipeline ran cleanly throughout.

Velocity

Evals have to keep up with the product. The step primitive stays out of the way — you write a regular async function, wrap it in step(), and it's durable. Fast to ship, easy to undo.

One vendor

We're a Vercel shop. Keeping ops under one roof means simpler observability and less to maintain.

Dev experience

Seamless with the rest of our Vercel tooling. Local development in particular felt great — that was a hard requirement for the team when shopping for a new async-jobs vendor.

Joins

See what's next at miro.com/ai