Skip to main content

Evals / Automated Tests

Define a test suite that checks your thunk's outputs against expected answers, so you have confidence it produces correct results while iterating or in production.

Automated Tests is designed for thunk designers who want confidence that their thunk is producing the correct, expected outputs — whether they're actively building and iterating, or the thunk is already in production.

You define a test suite — a collection of test sets that together cover all the key scenarios your thunk should handle. The feature runs each test set and checks the thunk's outputs against your expected answers. It works best for properties with stable values that don't change over time, for example:

  • "Slide number should be 32"

  • "Amount must not be > 10,000"

  • "Email should be X"

It is not the right fit for properties like schedules, conversations, or external dynamic webpages whose expected values change frequently, as this would cause test sets to go stale.

When should you use it?

  • After you have a stable thunk design — run your test suite to establish a baseline and confirm your thunk is producing the correct, expected outputs.

  • While iterating on a specific row — if you've modified instructions for one row, run the test suite to verify nothing regressed elsewhere across the test sets.

  • In production — use it as a periodic health check to confirm your thunk continues to produce the correct, expected outputs across the test suite.

How does it work?

A test set is defined at the individual row level — it consists of an input row to be processed and a set of assertions that must hold true for that input. It is not a generic assertion but specific to that row. A collection of test sets across rows forms a test suite that covers all scenarios your thunk is expected to handle.

Assertions can range from simple field checks to complex structural validations, for example:

  • Slide number should be 32, Slide number and journey JSON fields must not be empty

  • Journey JSON should satisfy: for Europe/France, under the brand Boostrix, managed by Nikita Karnani with 3 UJs, there were 4 emails distributed across the year, with activity in July (1), August (1), and September (2).

When the test suite runs, the platform evaluates the output properties of a test row against the assertions defined for that test row, using an LLM. Results are displayed as pass/fail per assertion. A failure indicates one of two things: the thunk's instructions are not producing output at the required fidelity, or the assertion itself needs to be revisited.

Where to find it?

Enable the evaluation mode in your work items.

This allows you to see, enter and run test sets for a select number of rows you consider important signals for the reliability of this thunk.

Did this answer your question?