Skip to main content

Understanding LLM costs and latency

Learn how the LLM models used by AI agents incur cost and delay

Updated over a week ago

When AI agents are used to automate workflows, they use LLMs as the underlying engine to understand information, process instructions, and make decisions. The relationship between AI agents and LLMs is like the relationship between a software program (the logic of what needs to be done) and a CPU (the processor that executes individual instructions).

Thunk.AI supports the use of many different LLMs hosted by different providers (eg: GPT models hosted by OpenAI, Gemini models hosted by Google, Anthropic models hosted by AWS, etc). The platform uses a BYOL ("bring your own license") approach to enable access to the LLMs. All production customers provide their own API keys to access their preferred LLMs. Further, customers have the ability to choose different LLMs for different kinds of work across thunks and within each thunk. As an example, GPT-4.1 can be chosen to do planning, while GPT-4.1-mini can be used for workflow execution.

This leads to different tradeoffs in cost, latency, and quality.

Simplified execution model

Let us consider a simple thunk where planning is already done and it is ready to run workflows. Let us assume the workflow has three very simple steps. Also let us assume there are 10 workitems loaded for the thunk.

To execute each step on each workitem, an AI agent is invoked. In this case, 3 x 10 = 30 AI agents will be invoked. Each of these agents will use an LLM to do its work. Usually, there are at least 3 calls to the LLM to take even a simple action (a dynamic planning call, a do-the-work call, and a conclude call). More accurately, each AI agent invokes the LLM at least 3 times asking it to make a decision of what to do next. Its initial decision may be to record a dynamic plan of action, then to record a result, and then to tell the Thunk.AI agent runtime to conclude its work. So there are about ~3 x 30 = ~90 LLM calls made.

For each LLM call, there are inputs and there are outputs. The inputs include the instructions that are part of the workflow step, the input data that feeds into the step, and a lot of default information that is necessary for the LLM to know what to do (eg: what AI tools are available, how it should respond so that the Thunk.AI system can correctly interpret the responses, etc.). The outputs are typically the decisions made by the LLM, including results to record into the workflow state, outputs to write to a file, etc. In most thunks, the size of the outputs is much smaller than the size of the inputs.

LLMs measure their inputs and their outputs in the form of "tokens". A token roughly corresponds to a short segment of a word or a punctuation mark. A good rule of thumb for text inputs or outputs is that the number of tokens = ~ 1.25 * number of words.

LLM costs

The cost of a single LLM call is (#_input_tokens * cost_per_input_token + #_output_tokens * cost_per_output_token).

Each LLM model has a different cost for processing input tokens and output tokens. For example, OpenAI lists its LLM pricing here. These costs change over time and as new models are released.

However, there are some general rules of thumb...

  • Newer LLMs (eg: gpt-5) are generally both cheaper and higher quality than older LLMs (eg: gpt-4)

  • In the same LLM family (eg: gpt-4.1), the smaller models are cheaper but lower quality (eg: gpt-4.1-mini is 5x cheaper than gpt-4.1).

  • Input tokens are much cheaper than output tokens (usually 5x cheaper)

There are further efficiencies to be gained when there are many LLM calls made in rapid succession (eg: a batch of workitems are run together). This allows the LLM to cache and reuse some common prefixes of input tokens. These cached input tokens are much cheaper than regular input tokens and the corresponding LLM calls are also faster. The Thunk.AI platform automatically leverages and optimizes LLM invocations to leverage these efficiencies.

In the hypothetical simple example above, each LLM call might have 10,000 input tokens of which 5000 end up being cached, and 500 output tokens. If GPT-4.1-mini was used as the model, the cost would be 5000 * (0.40/1M) + 5000 * (0.10/1M) + 500 * (1.60/1M) = $0.0033. With 90 LLM calls, the overall cost to process 10 workitems would be $0.30.

Doing the same work with the GPT-4.1 model would cost $1.50.

LLM costs can add up when instructions are complex, there is a longer workflow, the data inputs or results of tool calls are large, there are many items to process, or more expensive models are in use. It is not uncommon for a single workitem to incur a cost of several dollars. However, it might be automating many hours of tedious human work, making the cost worthwhile.

LLM latency

LLMs have delay (latency) in responding to requests. In an interactive system like ChatGPT, some of this delay can be hidden by streaming out results one word at a time to engage the user. However, in an automated system like a Thunk.AI workflow, the entire result is needed before it can be validated and processed.

The latency of an LLM response varies depending on a variety of factors, but there are some general rules of thumb:

  • Latency is affected mostly by the number of output tokens, not by the number of input tokens.

  • Smaller models can be 10x faster than the larger and more powerful models

How to see the cost and latency of your LLM calls

When you run any workitem through a workflow, you can see the LLM time and cost of each AI agent call by expanding the details of the agent's work.

You can also see the total LLM time and cost across the whole workitem. This information is shown along with the output of the workflow. In this example below for a simple workflow using GPT-4.1-mini, the workitem took 26s of AI agent processing time, and cost 1.7c in LLM token costs.

Tradeoffs between cost, latency, and quality

It is important to remember that the smaller models (eg: GPT-4.1-mini) are BOTH faster AND cheaper than the larger models (eg: GPT-4.1). They are not just a bit faster, they are typically 5x faster. However, the larger models have better quality in instruction-following and accuracy. For high-value workflows where reliability matters and the cost of error is high, it may be worth the tradeoff to use a more expensive and slower model.

In Thunk.AI, the default model used as of Q4 2025 is GPT-4.1-mini. This model represents a good balance of speed, low cost, and high quality.

However, at the level of your account and at the level of every thunk and at the level of every step in the workflow, you can change the LLM model used. You can choose to invoke a higher quality model (eg: GPT-4.1 or GPT-5) for certain tasks as needed or as determined by testing the quality of the results.

The Enterprise version of Thunk.AI also includes the option to use AI to check for injection attacks and other security issues. This is very valuable and very important to get as right as possible. By default, we configure the model used for this capability to be GPT-5, as it has the highest quality in detecting this class of security issues.

Did this answer your question?