Skip to main content
All CollectionsEnterprise Platform
Reliability and Consistency Benchmark
Reliability and Consistency Benchmark

Understand the reliability and consistency of Thunk.AI on core AI agent productivity tasks

Praveen Seshadri avatar
Written by Praveen Seshadri
Updated this week

Most business workflows (invoice processing, procurement, billing, recruiting, customer service, etc) mandate that a predictable high-level process be followed, with subjective judgment applied to handle the unique specifics of each specific instance. When AI agents are used to automate such business processes, reliable and consistent AI agent behavior is essential.

We have defined an AI agent task reliability "benchmark" (an agentic workload that is representative of the kind of work that an AI agent platform would have to automate in a business workflow) and have measured the behavior of the Thunk.AI platform against this benchmark.

The Thunk.AI Task Reliability Benchmark measures the reliability of the Thunk.AI agent platform as it follows specific task instructions.

Reliability and consistency are key aspects of the broader AI governance mechanisms designed into the Thunk.AI platform. The purpose of this benchmark is three-fold:

  1. transparency for customers of the Thunk.AI platform

  2. iteration and improvement of the Thunk.AI platform

  3. discussion and learning across the broader ecosystem of AI agent platforms


It is difficult to construct a generalized AI agent platform benchmark.

  • AI agents are a new concept

  • The scenarios and workloads are evolving

  • The work an agent performs is usually complex

  • The end-to-end reliability of outcomes depends on a complex interaction of many tasks and systems

  • Every AI agent platform is different in abstractions and capabilities, making true comparisons across platforms difficult

For all of these reasons, the work described here does not meet the standard of an academic benchmark nor is it intended to be. Our methodology for selecting the workload has been based on the use cases we have seen in our customer interactions and lacks any further scientific justification. We have not (yet) published the data sets and test descriptions in detail. This is purely a time constraint and we hope to do this soon.

Enterprise customers evaluating the adoption of an AI agent platform naturally have concerns about AI reliability. A task-level benchmark like the one described here is essential in any such evaluation. While it is informative (if basic tasks are not reliably executed, their combination into an end-to-end workflow is unlikely to be reliable and consistent), it is not sufficient. Prototyping the end-to-end workflow and observing actual end-to-end reliability and consistency is still important.

With full acknowledgment of these caveats, here is the Thunk.AI Task Reliability Benchmark.

Did this answer your question?