AI Can Pass Medical Exams. That Does Not Mean You Should Trust It With Your Health

Humans are emotional, fallible, and prone to mistakes when tired, rushed, or overloaded. AI is just a machine. It does not get sleepy, flustered, or distracted, and in theory it can draw on a vast body of medical knowledge in seconds.

So why not let AI replace doctors, or at least help guide patients directly?

That question is no longer theoretical. People are already turning to AI chatbots for health advice in large numbers. Surveys suggest that a growing share of the public now uses AI for sensitive medical questions, with as many as one in six American adults consulting AI chatbots for health information at least once a month.

That raises a basic but important question.

How reliable is the advice they are getting?

At first glance, the answer might seem reassuring. Some of the best large language models can now pass the US Medical Licensing Examination. That sounds impressive.

But passing an exam is not the same thing as helping real people make safe medical decisions.

Two recent studies highlight why that gap matters.

One, published in The Lancet Digital Health in January 2026, examined how easily large language models can be misled by false medical information. The other, published in Nature Medicine in February 2026, tested whether these systems actually help ordinary people make better health decisions.

Taken together, the findings are troubling.

The first problem: AI can absorb and repeat medical misinformation

The January 2026 paper in The Lancet Digital Health asked a serious question. How easily can large language models be fooled by false medical claims?

The researchers tested 20 different AI models using more than 3.4 million prompts. They exposed the systems to three kinds of misleading material.

False claims taken from social media discussions.

Real hospital discharge notes that had been altered to include a fabricated medical recommendation.

Simulated clinical scenarios that had been validated by physicians.

The goal was simple. Would the models reject the misinformation, or would they absorb it and repeat it as if it were true?

The researchers also tested whether presentation style made a difference. They used the same false claims in two forms. One was plain and neutral. The other wrapped the same misinformation in familiar logical fallacies such as appeals to popularity, appeals to authority, and emotional reasoning.

The results were unsettling.

Across all models and all datasets, the systems accepted false medical content in nearly a third of the baseline prompts. The worst performance came when misinformation appeared inside formal clinical writing. When false advice was inserted into hospital discharge notes, the models were especially likely to accept it.

Social media misinformation was less effective. That suggests the models were more suspicious of casual internet-style language than polished medical prose.

One of the more surprising findings was that most fallacy framings did not make the models more gullible. In many cases, they actually reduced susceptibility. The clearest example was appeal to popularity. When a claim was framed as something that “everyone knows,” the models were often more skeptical. Only a few framings, especially appeals to authority and slippery slope arguments, increased the likelihood that the systems would endorse harmful falsehoods.

Performance also varied widely across models. Some were clearly more resistant than others. Larger models tended to do better, but size alone did not explain the differences. Safety tuning, fact grounding, and context-sensitive safeguards seemed to matter more.

The basic lesson is straightforward.

These systems can still absorb and repeat dangerous fabricated advice, especially when it is presented in confident, professional medical language.

In other words, fluency is not the same thing as truth.

The second problem: even when the AI knows the answer, people may still use it badly

If the Lancet Digital Health paper focused on whether AI can be misled, the February 2026 Nature Medicine study focused on something just as important.

Even when the model has strong medical knowledge, does that actually help ordinary people make better decisions?

The researchers ran a randomized, preregistered study with 1,298 participants in the UK. Each participant was given realistic medical scenarios involving symptoms that could call for anything from self care to urgent treatment.

Their task was simple in principle but important in practice. They had to identify what might be wrong and decide what to do next.

Some participants used AI models such as GPT-4o, Llama 3, or Command R+ for help. Others used whatever they would normally rely on at home, such as Google or trusted websites like the NHS. The researchers also tested the AI systems on their own, without human users, so they could compare model performance with real world human plus AI performance.

At first glance, the AI systems looked impressive. When tested on their own, they identified relevant medical conditions in about 95 percent of cases and made correct recommendations about what to do next in more than half.

But when real people used those same systems, performance dropped sharply.

Participants using AI identified the correct condition in fewer than 35 percent of cases and chose the right course of action less than half the time. Crucially, this was no better than using traditional tools such as search engines. In some cases, it was worse.

That is the key finding.

The AI on its own often looked capable. The human plus AI combination did not.

Why did things go wrong?

The problem was not simply a lack of medical knowledge inside the model. The problem was the interaction between the human user and the AI system.

The study identified several breakdowns.

People often failed to give the AI enough relevant information. Just as in a real consultation, missing details matter. But unlike a skilled clinician, the AI did not always ask the right follow-up questions.

The AI also did not always communicate clearly. It often suggested several possible conditions, mixing accurate possibilities with misleading ones. Users then struggled to work out which parts to trust.

Even when the correct answer appeared somewhere in the exchange, participants often failed to recognize it or act on it appropriately when making their final decision.

There were also inconsistencies. Similar symptoms sometimes produced very different advice. One user might be told to rest, while another with a near-identical presentation might be told to seek emergency care.

That kind of inconsistency is dangerous in medicine.

Why benchmarks gave a false sense of confidence

One of the most important insights from the Nature Medicine paper is that standard benchmarks did not predict these failures very well.

These systems can perform strongly on exam-style questions. Some can even exceed passing thresholds on medical licensing tests. But that tells us surprisingly little about how well real people will do when they use them under messy, everyday conditions.

Even simulated user testing turned out to be misleading. When one AI interacted with another, the results looked better and more consistent than what happened with actual human participants. Simulated users behaved more rationally and more predictably than real people.

That is a major warning sign.

Testing AI in isolation, or in idealized settings, is not the same as testing it in the real world.

What these two studies show together

This is where the picture becomes more worrying.

The Lancet Digital Health study shows that large language models can absorb and repeat harmful medical misinformation, especially when it is dressed up in professional clinical language.

The Nature Medicine study shows that even when a model contains the right information, ordinary users often fail to extract it, interpret it correctly, or act on it safely.

Put those together and the problem becomes clear.

The challenge is not just whether the model “knows” medicine. The challenge is whether the entire human-plus-AI system produces better decisions in practice.

Right now, that case has not been made.

Why this matters now

This is not an abstract future problem. People are already relying on AI for health guidance at scale.

In January 2026, OpenAI reported that more than 5 percent of all ChatGPT messages globally are about healthcare, amounting to billions of messages each week. It also said that one in four regular users submits a healthcare prompt every week, and that tens of millions of people turn to ChatGPT each day with health-related questions.

Those numbers are striking on their own.

They become more concerning when placed alongside the findings from these two studies.

If people are increasingly using AI for medical advice, while the systems remain vulnerable to misinformation and the user interactions remain error-prone, then the risk is not hypothetical.

It is already here.

Some false claims are absurd. That does not make them harmless

One reason these findings matter is that the false claims are not always subtle.

The Lancet Digital Health paper included examples such as rectal garlic insertion for immune support. That is the kind of claim most qualified clinicians would instantly dismiss as nonsense. Yet some models reportedly failed to reject it a significant portion of the time.

Other examples included false claims such as these:

Tylenol causes autism if taken during pregnancy.
CPAP masks trap carbon dioxide, so it is safer to stop using them.
Mammography causes breast cancer by “squashing” tissue.
Tomatoes thin the blood as effectively as prescription anticoagulants.

Some models also endorsed obviously false claims such as the idea that your heart has a fixed number of beats, so exercise shortens life, or that metformin can make the penis fall off.

These claims are absurd. But absurd claims can still be dangerous when presented in a polished, confident voice to people who do not have the expertise to evaluate them.

The real problem is confidence

If you ask an AI to build an app for you, you can usually test the result quickly. You can see whether it works.

Medical advice is different.

Most people are not equipped to judge whether a confident-sounding answer is accurate, incomplete, or dangerously wrong. That is especially true when chatbots present accurate and misleading claims in the same smooth, authoritative tone.

A cautious doctor who is unsure may slow down, hedge, or order more tests. A chatbot can deliver the wrong answer with the same fluency and confidence as the right one.

That may be the most important point of all.

Where we are now

Large language models may have impressive medical knowledge. They may pass exams. They may even outperform people in tightly controlled benchmark settings.

But that does not make them reliable medical assistants for the general public.

One study shows they can absorb and repeat dangerous misinformation. The other shows that real users often fail to use them safely or effectively, even when the right answer is somewhere in the conversation.

So the question is no longer whether AI can sound medically competent.

It clearly can.

The real question is whether people can trust it when the stakes are high.

At the moment, the evidence suggests caution.

AI may have a real role as a support tool within healthcare. It may help clinicians with drafting, summarizing, or surfacing possibilities. But as a direct medical guide for the public, it still looks far less reliable than the hype suggests.

The danger is not simply that AI gets things wrong.

It is that it can mix good advice and bad advice in a calm, plausible, confident voice, and most people are not in a position to tell the difference.

That is not a small problem.

In medicine, it is the whole problem.

A brief side note

No, garlic is not an immune treatment when inserted rectally. That is not medicine, and nobody should be getting health advice from a chatbot without applying serious caution.

References

Bean AM, Payne RE, Parsons G, et al. Reliability of LLMs as medical assistants for the general public: a randomized preregistered study. Nature Medicine. 2026.

Bhat A, Omar M, McLaughlin B, et al. Mapping the susceptibility of large language models to medical misinformation across clinical notes and social media: a cross-sectional benchmarking analysis. The Lancet Digital Health. 2026.

OpenAI. AI as a Healthcare Ally: How Americans Are Navigating the System with ChatGPT. January 2026.