Most business workflows (invoice processing, procurement, billing, recruiting, customer service, etc) mandate that a predictable high-level process be followed, with subjective judgment applied to handle the unique specifics of each specific instance. When AI agents are used to automate such business processes, reliable and consistent AI agent behavior is essential.
We have defined an AI agent task reliability "benchmark" (an agentic workload that is representative of the kind of work that an AI agent platform would have to automate in a business workflow) and have measured the behavior of the Thunk.AI platform against this benchmark.
The Thunk.AI Task Reliability Benchmark measures the reliability of the Thunk.AI agent platform as it follows specific task instructions.
Reliability and consistency are key aspects of the broader AI governance mechanisms designed into the Thunk.AI platform. The purpose of this benchmark is three-fold:
transparency for customers of the Thunk.AI platform
iteration and improvement of the Thunk.AI platform
discussion and learning across the broader ecosystem of AI agent platforms
It is difficult to construct a generalized AI agent platform benchmark.
AI agents are a new concept
The scenarios and workloads are evolving
The work an agent performs is usually complex
The end-to-end reliability of outcomes depends on a complex interaction of many tasks and systems
Every AI agent platform is different in abstractions and capabilities, making true comparisons across platforms difficult
For all of these reasons, the work described here does not meet the standard of an academic benchmark nor is it intended to be. Our methodology for selecting the workload has been based on the use cases we have seen in our customer interactions and lacks any further scientific justification. We have not (yet) published the data sets and test descriptions in detail. This is purely a time constraint and we hope to do this soon.
Enterprise customers evaluating the adoption of an AI agent platform naturally have concerns about AI reliability. A task-level benchmark like the one described here is essential in any such evaluation. While it is informative (if basic tasks are not reliably executed, their combination into an end-to-end workflow is unlikely to be reliable and consistent), it is not sufficient. Prototyping the end-to-end workflow and observing actual end-to-end reliability and consistency is still important.
With full acknowledgment of these caveats, here is the Thunk.AI Task Reliability Benchmark.