Evals

Running an Eval

Run Evals to compare model and prompt variations across test data, then choose the best-performing configuration for your Flow.

If you are new to Evals, start with What are Evals? for the conceptual overview.

Start an Eval

  1. Open the Flow you want to evaluate.

  2. Click Run.

  3. Select the Eval tab.

  4. Choose an execution mode, set up your configurations, and click Run Eval.

Choose an execution mode

Evals support two execution modes depending on how you want to test.

Realtime

Realtime runs your Eval immediately with streaming results. Use it for quick comparisons when you want to enter test input directly and watch outputs as they stream.

  • Enter messages for chat-based Flows or variables for other Flows.

  • Review results in real time so you can compare outputs immediately.

Batch

Batch queues your Eval to run against all Records of a selected type. Use it when you want to test at scale with data you already collected.

  • Select a Record type from your existing Records.

  • The Eval runs your Flow once per Record for each configuration.

  • Batch Evals run in the background, so you can leave the page and come back later.

In test mode, Batch Evals are limited to the first 10 Records so you can validate your setup before running a full comparison. If you need to prepare test data first, see Creating and managing records.

Set up configurations

Configurations are the variations you want to compare. Each configuration can override settings on any prompt step in your Flow. Runtype labels them with letter badges such as A, B, and C so you can compare results more easily.

Give each configuration a clear name such as Baseline, Budget Option, or Creative Mode.

What you can override

  • Model — Compare model choices such as claude-sonnet-4-5, gpt-5-mini, or gemini-3-flash.

  • Temperature — Test more deterministic or more creative outputs.

  • Max tokens — Control response length.

  • Response format — Switch between JSON, markdown, XML, or HTML output.

  • Reasoning — Enable or adjust extended thinking for supported models.

  • Tools — Add, remove, or change which tools are available to the step.

Example setup

Configuration

Model

Temperature

A - Baseline

claude-sonnet-4-5

0.7

B - Budget

gpt-5-mini

0.7

C - Premium

claude-opus-4-5

0.3

This setup runs your Flow once per configuration for each test case so you can compare the tradeoffs directly.

Understand your results

Realtime results

After a Realtime Eval completes, you will see a step-by-step comparison table with each configuration's output, model, duration, and cost.

Batch results

Batch results are available on the Evals page. Evals in the same group are linked so you can compare them together.

Click Compare to open the comparison view, which includes:

  • Winner cards — Highlights which configuration had the highest success rate, lowest cost, and fastest execution.

  • Metrics table — Sortable by success rate, average duration, total cost, token usage, and step counts.

  • Record-level drill-down — Open individual Records to review step-by-step outputs across configurations.

  • Step analysis — Keyword analysis across step outputs to spot patterns.

Tips for getting the most out of Evals

  • Start with Realtime — Use a quick Realtime Eval to validate your configurations before you run a full Batch Eval.

  • Include a baseline — Add your current production configuration so you can measure improvement.

  • Use representative data — For Batch Evals, include common cases and edge cases in your Records.

  • Name configurations clearly — Clear labels make results easier to scan later.

  • Compare one variable at a time — If you change both the model and the temperature, it is harder to explain the result.

Eval limits

Each plan includes a daily Eval limit to help manage usage. If you run Evals often, check your plan limits in Settings. Batch Evals count as one Eval per configuration submitted, regardless of how many Records are processed.

Next steps

Was this helpful?