Thunk.AI uses an LLM as its "AI engine". Of course, we know that LLMs are probabilistic and therefore can produce non-deterministic results. All the same, a business workflow needs reliable and consistent results. The responsibility of the AI agent platform is to harness a not-so-reliable but intelligent LLM to achieve reliable results.
In this article, we describe how the Thunk.AI platform achieves reliability and consistency. A measure of this reliability is captured by the Task Reliability Benchmark.
The assumption is that a business workflow has to run repeatedly in similar but slightly different environments and inputs. There are three expectations of a reliable workflow in such a context:
It achieves a desired outcome in each instance.
It follows the prescribed process in each instance.
To the extent the environments and inputs vary, it makes sensible adaptations as appropriate for each variation.
There is no silver bullet solution for AI agent reliability. Instead, a set of core principles need to be applied across the planning and runtime phases, taking a "defense in depth" approach to ensure that relatively few errors are created, and when created, the errors are detected and corrected.
In each phase of AI agent activity, planning and execution, there are four core principles that drive reliability in the design of the Thunk.AI platform:
Thunk.AI Platform Design for Reliability : Planning
Principle of static planning: when work is explicitly planned and a plan ( a sequence of steps) is articulated, it provides a process guideline for repeated consistent execution. More detailed process guidelines lead to more reliable results.
Principle of minimal granularity: the broader the instructions given to the LLM and the broader the context it has to interpret, the more variability there will be in the results. Therefore, if reliability is the goal, it is best to provide the "tightest" (most granular) instructions and context.
Principle of maximal constraints: the broader the possible set of responses from an LLM in a particular context, the greater the variability of those responses. Therefore, if reliability is the goal, it is best to restrict the LLM to the "tightest" (most limited) set of allowed responses.
Principle of minimal capability: LLMs interact with the business environment through "tools" to read or update content. These tools are provided by the AI agent platform. If reliability is the goal, the minimal set of tools should be provided at each stage.
In every thunk, there is an explicit planning phase which produces a human-guided process flow. This reflects the Principle of Static Planning.
The plan itself has many granular components to it: it has a sequence of workflow steps and it defines schematized state that the workflow should maintain. Every step of work is granular and the degree of granularity is in the control of the thunk designer. Finer granularity leads to more specific process. Coarser granularity leads to more flexibility in dealing with dynamic environments. This reflects the Principle of Minimal Granularity. The actual choice of workflow step granularity depends on the thunk designer and the particular business process.
Each granular step of the workflow includes detailed AI agent instructions. These instructions have two elements -- steering (what it should do, includes examples which are particularly useful in guiding an LLM) and control (what it is allowed to do). The critical element influencing reliability is about control. For example, the input and output data properties are specified and each of them has specific constrained types. This reflects the Principle of Maximal Constraints.
The AI agent invokes the LLM in an iterative conversational loop, but only allows it to respond by invoking one of the tools provided. Free text responses (one of the greatest sources of randomization) are not allowed. The set of tools provided to the LLM in every conversational iteration is restricted by the thunk designer, reflecting the Principle of Minimal Capability.
Thunk.AI Platform Design for Reliability : Execution
Principle of dynamic planning: most AI agent tasks require multiple iterations and tool invocations. Dynamic micro-planning of AI agent tasks provides a guideline for reliable
Principle of explanation: If LLMs are required to provide reasoning for their responses, it forces alignment with the goals and plan for the task at hand.
Principle of checkpointing: If LLMs are required to update schematized state with their partial progess or results, it improves alignment with the desired outcomes, increasing consistency and reliability.
Principle of verification: if every LLM response is checked for validity, there is an opportunity to correct and refine results. There can be a variety of checks, including deterministic checks (eg: for schema conformance), checks implemented by LLM calls (eg: for semantic conformance to constraints), and human-in-the-loop verification.
When the platform executes an AI agent task (one step of a planned workflow), many platform design decisions influence the reliability of the results.
Since every individual task execution involves (a) potentially multiple iterations with the LLM, (b) multiple tool calls, (c) variable environments and inputs, the Thunk.AI platform always starts with "micro-planning" the task. This reflects the Principle of Dynamic Planning. The dynamic micro-plan is itself constrained by the available tools and by the data bindings specified during the initial planning phase, so it creates a further level of detail for subsequent execution. By explicitly requiring articulating of the micro-plan, the AI agent platform steers subsequent stages of the iteration in a consistent direction.
Every response from the LLM is a tool call with arguments and importantly, an explanation. This reflects the Principle of Explanation. There are three benefits of these explanations. One important benefit is that the explanation increases the alignment of the LLM's immediate response with the desired goal and plan. In effect, the requirement to provide a rational explanation acts as a constraint on the response of the LLM. A second benefit is that the explanation reinforces alignment of subsequent LLM responses with the plan. Finally, the explanations are useful for human validation.
The platform steers the AI agent to checkpoint its work and update the workflow state as work progresses. Since the workflow state is schematized and structured, this imposes constraints on the output of the LLM. This reflects the Principle of Checkpointing. Just like the principle of explanation, this increases alignment of the LLM's responses with the desired outcomes.
Finally, every response of the LLM, every tool result, and every workflow step is checked for consistency. This reflects the Principle of Verification. If the verification identifies inconsistencies or inaccuracy, these are fed back to the LLM for correction. There are many kinds of verification. Conformance with schema and structure are the most obvious and deterministic. More subjective verification is very valuable also -- for example, whether an LLM response conforms to policy, or whether an LLM response satisfies the descriptions of tool arguments or of workflow state property descriptions.
A measure of the Thunk.AI platform reliability for this task-level granular agent activity is captured by the Task Reliability Benchmark.
The platform also offers human-in-the-loop approval as a final category of verification that can optionally be required.
End-to-end workflow reliability
In practice, the end-to-end reliability of AI agent automation depends on a combination of four factors:
The nature of the workflow process --- how specific the process is and how much "intelligent" decision-making is expected from AI agents to handle variability.
How much detail is provided by the thunk designer during the planning phase
The inherent reliability of the AI agent platform in following the plan, adhering to provided instructions, and controlling the LLM's responses towards the desired outcomes (the primary focus of this article!)
The actual degree of variability of the runtime workload.
β