Just lately, there’s been some very public (and, frankly, very humorous) AI agent and bot failures.
Like Chipotle’s assistant supporting codegen (since patched): “Cease spending cash on Claude Code. Chipotle’s assist bot is free” (r/ClaudeCode)
And in a surreal vogue, Washington state’s call-center hotline offering Spanish assist by talking English with a Spanish accent: “Washington state hotline callers hear AI voice with Spanish accent” (AP Information)
Coinciding with this, different Forrester analysts and I’ve had a spate of calls the place organizations have launched a brand new AI agent with out testing them.
Put merely, please don’t do that.
Please check your AI brokers earlier than launching them — some choices on how to do that are under.
What will we imply by this?
At minimal: Take a look at all your bot’s options (and use circumstances) your self.
For any AI agent, or new characteristic you’re introducing to it, the minimal effort you must make investments is to ensure somebody has used it as an finish consumer earlier than this goes dwell.
This may be so simple as somebody on the developer staff or as concerned as a devoted testing group. However it’s worthwhile to ensure that somebody has actively used your resolution — and all its options. This must also be accomplished on an ongoing foundation in order that when new options are launched, they’re examined, too.
This may be time-intensive, however as we see with the general public circumstances, not every little thing works as anticipated on a regular basis.
In reality, AI can go improper in additional sudden methods than earlier than. For those who can’t be certain that options are working as supposed, you then may find yourself on the information.
Please word that that is the minimal doable effort. This isn’t sufficient to make sure that one thing gained’t go improper or your software gained’t fail — this can solely catch the obvious/embarrassing outcomes. A extra sturdy testing apply is really useful.
For extra on how agentic techniques fail: Why AI Brokers Fail (And How To Repair Them)
Beneficial: Follow pink teaming.
A great way to forestall this type of sudden permutation is with pink teaming or deliberately making an attempt to interrupt the bot. We advocate this as a regular apply to your group.
There are two sides to this: One is conventional or infosec pink teaming. That is targeted on discovering safety exploits. The second is behavioral. That is targeted on getting the answer or mannequin to behave in an inappropriate or unintended vogue. It’s best to have a apply on each.
On the very least, your staff ought to kick the tires for a day and take a look at as many exploits as doable. Even when you may have a governance layer, you need to be certain that it’s holding up within the wild or, ideally, even post-launch.
For extra on the pink staff apply: Use AI Pink Teaming To Consider The Safety Posture Of AI-Enabled Purposes
For extra on customary governance approaches that must be adopted: Introducing Forrester’s AEGIS Framework: Agentic AI Enterprise Guardrails For Data Safety
For particular widespread governance failures, see AIUC-1’s web page, “The world’s first AI agent customary”
For a enjoyable instance of what employee-driven pink teaming can seem like, try Anthropic’s write-up, “Venture Vend: Can Claude run a small store? (And why does that matter?)”
Beneficial: Take a look at utilizing a testing suite and apply.
Testing an AI agent system that has agentic capabilities remains to be an rising discipline, however speedy progress is being made. To complement your testing applications (people whose job is to check your AI instruments, functions, and brokers), testing suites present extra built-in assist. There are two methods to consider testing suites right now: artificial and ongoing agentic.
Artificial exams are easy — they check your AI agent in opposition to a pattern of precreated prompts and preferrred solutions to behave as a “golden set” to check in opposition to. This lets you carry out a regression check over time to validate the query, “Does our AI agent present the right responses?”
However artificial regression exams are sometimes solely carried out for an AI agent after some noteworthy change, akin to switching out the mannequin used or introducing quite a few new use circumstances. More and more, bigger testing suites need to check robotically and constantly. Different strategies like giant language model-as-a-judge can present supplementary runtime supervision.
(Additional work is coming from Forrester on artificial testing.)
Please word that if you happen to don’t have a proper testing program for AI techniques, please both rent individuals for this or rent a testing companies firm.
For extra on constructing exams, see Anthropic’s, “Demystifying evals for AI brokers”
For extra on autonomous testing: The Forrester Wave™: Autonomous Testing Platforms, This fall 2025
For how one can make steady testing work: It’s Time To Get Actually Severe About Testing Your AI: Half Two
Beneficial: Take a look at with a consultant pattern.
The last word check of your brokers, nonetheless, will come out of your customers. They alone decide if you happen to go or fail. It’s in your greatest pursuits to make them completely satisfied.
The query is: How will we check with actual customers earlier than manufacturing? The reply is a consumer champion group (or comparable conference). These are customers who’ve both volunteered themselves or been chosen by you to check what your agent is able to.
That is simpler in internal-facing use circumstances, as worker teams are extra easy to assemble, however many customer-facing organizations can obtain the identical factor by way of voluntary check sign-ups.
The danger is that you’ve got customers who’re an overeager group who don’t make up a consultant pattern of your consumer base. In different phrases, they don’t essentially signify your common consumer. This may be prevented by way of cautious group design or, at the least, asking customers to tackle a persona when conducting the check.
If this isn’t doable, you would use a canary check/conditional rollout that may function this testbed (although it’s higher when it’s voluntary).
For extra on constructing this consumer champion group internally: Finest Practices For Inner Conversational AI Adoption


