

Let’s talk about a scary thought:
What if a chatbot is too good at sounding sure of itself?
In the world of AI, there ’s a sneaky phenomenon called Automation Bias. It’s that instinct we all have to trust a computer’s answer blindly, just because it sounds “smart”.
But in medical software, that kind of blind trust can be dangerous.
Today, I want to take you under the hood of a fictional project: MediBot, an AI triage assistant that uses NLP and Machine Learning to help patients decide what to do when they feel unwell.
While MediBot isn’t real, the risks and the testing strategies absolutely are.
This post was inspired by concepts from the ISTQB AI Testing syllabus and how they can be applied to real-world scenarios like this one.
MediBot’s job is simple: you describe your symptoms, and it gives you a recommendation.
Behind the scenes, a system like this would typically rely on:
Sounds straightforward, right?
Well… not quite.
Imagine a user says:
“My chest hurts when I walk fast.”
A system like MediBot might internally estimate something like:
And then respond:
“It’s likely indigestion. Try resting and drinking water.” ❌
Wait… what?
That’s a potentially serious cardiac symptom being dismissed as something minor.
And here’s the real danger:
Because the chatbot sounds calm and confident, a user might actually trust it and stay home.
That’s Automation Bias in action.
Inspired by ISTQB principles for AI testing, here are five approaches teams should apply to make systems like MediBot safer and more reliable.
One of the first things to question is the training data.
In a system like MediBot, it’s easy to imagine a dataset where:
That imbalance creates bias.
👉 The model learns what is common, not what is critical.
What should be done:
Why this matters for QA:
Even if a condition is statistically rare, it must be treated as a high priority in testing. In medical AI, severity matters more than frequency.
Instead of guessing where bugs might appear, a Defect Prediction approach can help teams focus on the most fragile parts of the system.
Think of it as a “weather forecast” for bugs.
In a system like MediBot, a model could be fed with:
The output might look like this:
That insight should directly shape the testing strategy.
👉 Instead of testing everything equally, teams should prioritize the highest-risk component.
In this case, the Symptom Classifier, where misclassification could directly impact patient safety.
Testing here should go deeper:
Why this matters for QA:
Not all parts of an AI system fail equally. Defect Prediction helps teams focus on what actually matters, so they’re not just testing more, but testing smarter.
Let’s be real: people don’t talk like medical textbooks.
A doctor might say “angina pectoris”, but a user might say:
If the AI only understands formal language, it will fail in the real world.
What should be done:
This allows teams to scale testing from a few examples… to thousands.
Why this matters for QA:
Without linguistic diversity, the model doesn't truly understand users; it’s just matching keywords.
4. The “Trick” Questions (Adversarial Testing)
To really validate robustness, you need to think like an attacker or a biased user.
Example:
A poorly tested system might latch onto “pizza” and downplay the real issue.
👉 This is a classic adversarial scenario:
A distracting detail leads the model away from a critical symptom.
What should be done:
Why this matters:
AI should not blindly agree with the user. Safety must take priority over being “helpful”.
This one is subtle but critical.
Compare this:
What should be done:
Why this matters:
Confidence directly influences user decisions. Reducing overconfidence helps mitigate Automation Bias.
While MediBot is a fictional example, these challenges show up in real systems too, just in different ways.
At Xmartlabs, we built a product called Redi that tackles a similar challenge from a different angle. Unlike MediBot, it doesn’t diagnose or estimate probabilities. Instead, it helps users prepare for a medical consultation by asking about their symptoms and generating a summary with suggested questions for their doctor.
At first glance, it’s a much simpler system:
But the safety challenge remains.
👉 What happens if a user describes an emergency?
This is where a key concept comes in: hard stops.
If a user mentions symptoms that could indicate something critical, like a stroke, the system should not continue the normal flow. It should interrupt the experience and clearly direct the user to seek immediate help.
For example:
And this is not just a design decision, it’s something that must be thoroughly tested.
As part of QA, we need to validate that the hard stop is consistently triggered across different ways of expressing the same emergency.
For example, instead of testing just:
We should also test variations like:
Why this matters:
Users don’t describe emergencies in predictable ways. If the system only catches one phrasing, it’s not truly safe.
In this case, the “decision logic” isn’t a separate module; it’s part of the prompt and the guardrails around the LLM.
Even in simpler AI systems, safety cannot be optional. The implementation might look different, rules engine vs prompt design, but the responsibility is the same.
There’s no single formula for testing AI systems.
Each architecture requires a different strategy, but the goal is always consistent:
Ensure the system behaves safely when it matters most.
As QAs, we’re not just testing features anymore.
We’re testing:
In AI systems, being “smart” is not enough.
A smart chatbot is useful. But a safe chatbot is essential.
Because in the end, the real question isn’t: “Does it work?” It’s: “Does it fail safely when it matters most?”
What do you think? Have you ever trusted an AI a bit too much, just because it sounded confident?