QA

Is Your Chatbot Overconfident? Taming "Automation Bias" in Medical AI

Sofia Techera
Sofia Techera
Blog Main Image

Let’s talk about a scary thought:

What if a chatbot is too good at sounding sure of itself?

In the world of AI, there’s a sneaky phenomenon called Automation Bias. It’s that instinct we all have to trust a computer’s answer blindly, just because it sounds “smart”.

But in medical software, that kind of blind trust can be dangerous.

Today, I want to take you under the hood of a fictional project: MediBot, an AI triage assistant that uses NLP and Machine Learning to help patients decide what to do when they feel unwell.

While MediBot isn’t real, the risks and the testing strategies absolutely are.

This post was inspired by concepts from the ISTQB AI Testing syllabus and how they can be applied to real-world scenarios like this one.

Meet MediBot: Your Virtual Triage Assistant

MediBot’s job is simple: you describe your symptoms, and it gives you a recommendation.

Behind the scenes, a system like this would typically rely on:

  • NLP → to understand the user’s input
  • ML Classifier → to estimate probabilities (Is it flu? COVID? something more serious?)
  • Decision Logic → to decide whether you need rest… or urgent care

Sounds straightforward, right?

Well… not quite.

The “Oops” Moment

Imagine a user says:
“My chest hurts when I walk fast.”

Chat example showing a medical AI dismissing chest pain as indigestion despite a 10% cardiac risk probability

A system like MediBot might internally estimate something like:

  • 60% → acid reflux
  • 10% → heart-related issue

And then respond:

“It’s likely indigestion. Try resting and drinking water.”

Wait… what?

That’s a potentially serious cardiac symptom being dismissed as something minor.

And here’s the real danger:

Because the chatbot sounds calm and confident, a user might actually trust it and stay home.

That’s Automation Bias in action.

How Should We Fix It? 5 Pro-Testing Moves

Inspired by ISTQB principles for AI testing, here are five approaches teams should apply to make systems like MediBot safer and more reliable.

1. The “Data Diet” (Testing for Bias)

One of the first things to question is the training data.

In a system like MediBot, it’s easy to imagine a dataset where:

  • Most cases are mild (anxiety, reflux, flu)
  • Very few represent critical conditions

That imbalance creates bias.

👉 The model learns what is common, not what is critical.

What should be done:

  • Balance the dataset with more high-risk scenarios
  • Ensure rare but dangerous conditions are well represented

Why this matters for QA:

Even if a condition is statistically rare, it must be treated as a high priority in testing. In medical AI, severity matters more than frequency.

2. Defect Prediction: Testing Smarter, Not Harder

Instead of guessing where bugs might appear, a Defect Prediction approach can help teams focus on the most fragile parts of the system.

Think of it as a “weather forecast” for bugs.

In a system like MediBot, a model could be fed with:

  • Code complexity
  • History of previous defects
  • Recent changes in medical logic

The output might look like this:

  • NLP Engine → 20% risk
  • Rule Engine → 15% risk
  • ML Symptom Classifier → 65% risk 🚩
Risk bar chart showing ML Symptom Classifier at 65% defect probability, flagged as priority over NLP Engine and Rule Engine

That insight should directly shape the testing strategy.

👉 Instead of testing everything equally, teams should prioritize the highest-risk component.

In this case, the Symptom Classifier, where misclassification could directly impact patient safety.

Testing here should go deeper:

  • Edge cases
  • Critical symptom combinations
  • Adversarial scenarios

Why this matters for QA:

Not all parts of an AI system fail equally. Defect Prediction helps teams focus on what actually matters, so they’re not just testing more, but testing smarter.

3. Mixing It Up (Synthetic Data)

Let’s be real: people don’t talk like medical textbooks.

A doctor might say “angina pectoris”, but a user might say:

Five linguistic variations of the same cardiac symptom, from formal medical terminology to casual colloquial language

If the AI only understands formal language, it will fail in the real world.

What should be done:

  • Generate synthetic variations of the same intent
  • Test across different tones, styles, and expressions

This allows teams to scale testing from a few examples… to thousands.

Why this matters for QA:

Without linguistic diversity, the model doesn't truly understand users; it’s just matching keywords.

4. The “Trick” Questions (Adversarial Testing)

To really validate robustness, you need to think like an attacker or a biased user.

Example:

Annotated chat message showing a critical chest pain symptom buried inside a distracting pizza reference, with safe vs. unsafe AI responses

A poorly tested system might latch onto “pizza” and downplay the real issue.

👉 This is a classic adversarial scenario:

A distracting detail leads the model away from a critical symptom.

What should be done:

  • Test with misleading, biased, or manipulative inputs
  • Ensure high-risk symptoms override contextual noise

Why this matters:

AI should not blindly agree with the user. Safety must take priority over being “helpful”.

5. Managing the “Vibe” (Automation Bias Evaluation)

This one is subtle but critical.

Compare this:

Comparison table showing a risky AI response dismissing chest pain as anxiety versus a safer response recommending medical attention

What should be done:

  • Avoid overly confident language
  • Introduce uncertainty when appropriate
  • Encourage safe actions in ambiguous cases

Why this matters:

Confidence directly influences user decisions. Reducing overconfidence helps mitigate Automation Bias.

From Theory to Practice: A Real-World Example (Redi)

While MediBot is a fictional example, these challenges show up in real systems too, just in different ways.

At Xmartlabs, we built a product called Redi that tackles a similar challenge from a different angle. Unlike MediBot, it doesn’t diagnose or estimate probabilities. Instead, it helps users prepare for a medical consultation by asking about their symptoms and generating a summary with suggested questions for their doctor.

At first glance, it’s a much simpler system:

  • It relies on an LLM and prompt design
  • There’s no custom-trained classifier or complex decision engine
  • The logic is embedded directly in how the model is guided

But the safety challenge remains.

👉 What happens if a user describes an emergency?

This is where a key concept comes in: hard stops.

If a user mentions symptoms that could indicate something critical, like a stroke, the system should not continue the normal flow. It should interrupt the experience and clearly direct the user to seek immediate help.

Three-step emergency flow diagram showing how Redi interrupts normal triage when stroke symptoms are detected

For example:

  • Suggest calling emergency services
  • Show nearby clinics or hospitals
  • Stop generating non-critical guidance

And this is not just a design decision, it’s something that must be thoroughly tested.

As part of QA, we need to validate that the hard stop is consistently triggered across different ways of expressing the same emergency.

For example, instead of testing just:

  • “I think I’m having a stroke”

We should also test variations like:

  • “My face feels numb, and I can’t move my arm”
  • “I suddenly can’t speak properly”
  • “One side of my body feels weak”

Why this matters:

Users don’t describe emergencies in predictable ways. If the system only catches one phrasing, it’s not truly safe.

In this case, the “decision logic” isn’t a separate module; it’s part of the prompt and the guardrails around the LLM.

Even in simpler AI systems, safety cannot be optional. The implementation might look different, rules engine vs prompt design, but the responsibility is the same.

There’s no single formula for testing AI systems.

Each architecture requires a different strategy, but the goal is always consistent:

Ensure the system behaves safely when it matters most.

The Bottom Line

As QAs, we’re not just testing features anymore.

We’re testing:

  • decisions
  • assumptions
  • and real-world consequences

In AI systems, being “smart” is not enough.

A smart chatbot is useful. But a safe chatbot is essential.

Because in the end, the real question isn’t: “Does it work?” It’s: “Does it fail safely when it matters most?”

What do you think? Have you ever trusted an AI a bit too much, just because it sounded confident?