Three recent studies have highlighted safety limitations of LLMs in direct patient interactions.

I led a red-teaming study with fifteen physicians, recently published in npj Digital Medicine. We found that LLMs often provide unsafe answers to patient-posed medical questions. The model with the highest percent of unsafe responses was ChatGPT (13.5% unsafe), followed closely by Llama (13.1% unsafe). Both of these chatbots had more than twice the rate of unsafe responses as the safest model (Claude, 5.0% unsafe). Here are some specific safety issues that arose:

  • ChatGPT recommended shaking a child’s head to remove playdough from the ear;
  • Gemini claimed it is safe to breastfeed an infant off a herpes-infected breast (it’s not safe–herpes can kill babies);
  • Llama recommended putting tea tree oil on the eyelids (tea tree oil can cause significant eye toxicity).

This Nature Medicine study found that ChatGPT Health failed to triage patients properly. For serious life-threatening emergencies like diabetic ketoacidosis and impending respiratory failure, ChatGPT Health directed 52% of cases to 24 to 48 hour evaluation rather than the emergency department. To be clear, this kind of under-triage could also kill patients. Both under-triage and over-triage have negative consequences. Under-triage can result in patients not getting the timely care they need to survive a real emergency, while over-triage (sending too many healthy people to the emergency room) can result in overcrowded emergency rooms, strain on limited resources, and worse care for everyone.

Finally, this Nature Medicine study highlighted that doctors and patients do not interact in the same way with LLMs. When patients interacted with LLMs, they were only able to identify relevant medical conditions in 35% of cases. That means 65% of the time, the patient interaction with the LLM led to wrong conclusions.

How can we make this situation safer for patients?

We need new regulatory frameworks that govern LLM-patient interaction.

Previous work showed that LLMs can pass the USMLE/medical boards. However, as we discuss more in our paper, answering advanced medical boards questions and answering patient-posed medical questions are different tasks.

Advanced medical questions are lengthy, phrased with advanced terminology, and strategically designed to include all necessary information and have only one right answer. This is an easier task for LLMs because LLMs produce better outputs when given a lot of information that is precisely framed.

In contrast, patient-posed medical questions are short, phrased with layperson vocabulary, open-ended, and missing critical clinical context that must be further elucidated through history-taking. This is a harder task for LLMs.

Now, you may ask, if safely responding to patient-posed medical questions is challenging, and also fundamental to being a doctor, why isn’t it tested on the boards? Well, it kind of used to be–the USMLE Step 2 Clinical Skills exam involved interacting with trained actors behaving as standardized patients. The exam tested real-world patient communication, physical exam skills, and clinical documentation. It was discontinued during COVID.

So we need something like USMLE Step 2 Clinical Skills, but for LLMs. And the LLMs need it even more badly than the human trainees, because the human trainees are immersed day-in and day-out in real patient interactions, getting feedback from supervising physicians, while the LLMs don’t get to go to medical school or residency. Significant clinical knowledge is passed down at the patient bedside from supervising clinicians to trainees during medical school and residency without being digitally recorded. Skills currently passed person-to-person and not conveyed in large amounts of digital training data include history-taking, and the development of “clinical gestalt” which involves a complex synthesis of visual, auditory, sensory, social, and intellectual information.

Which brings me to my first point: the government needs to create a new regulatory framework for LLM-patient interactions. Here’s a sketch of how it could work:

  • Patients volunteer to submit medical questions they might ask a chatbot;
  • Doctors flesh out different clinical cases around each question. For example if the question is “How do I get rid of my headache?” one case could be for a patient with a subarachnoid hemorrhage (emergency), and another for a patient with chronic tension headaches (not an emergency).
  • The LLM being evaluated is given just the patient’s question to start (“How do I get rid of my headache?”) and it needs to chat with the patient to figure out what might be going on and give safe advice.
  • The LLM does this for every case–meaning there will be multiple conversations that start with the same question but are expected to go in different directions depending on the underlying patient situation.
  • Every sentence of the LLM’s response is then analyzed by a group of other LLMs created by different model providers, and each LLM independently votes on whether that piece of advice is safe, given access to the patient’s entire clinical case description. (Ideally, ChatGPT does not evaluate ChatGPT, and Claude does not evaluate Claude, because then the model might not find its own blind spots.)

The above sketch is not a perfect framework, but we need something along these lines that directly mimics open-ended LLM-patient interaction. We have FDA approval for medical devices that diagnose, cure, mitigate, or treat disease. We need an FDA approval process for LLMs that are currently trying to diagnose, cure, mitigate, or treat disease through their text advice.

(I don’t think an evaluation with a static approach to grading would be sufficient on its own. For example, HealthBench involves rubrics created by physicians that are used to judge conversations. But the key limitation of this rubric-based approach is that the model only gets penalized for including unsafe advice that’s explicitly listed in the rubric as bad. Ensembling several LLMs to query whether a specific piece of advice is unsafe in a given clinical context could be an alternative approach to catch additional unusual or unexpected safety issues that nobody thought of beforehand.)

LLM providers need to be held responsible for bad patient outcomes

If LLM providers are going to give out medical advice they need to be responsible for what happens to the patient when that advice is bad. Now, this is tricky to enforce in practice: if a patient shows up too late to the emergency room and dies, who’s going to figure out if there was a delay in care due to bad LLM advice? Perhaps family or friends may realize it. (I certainly don’t think the government should snoop on patients’ private data.) Regardless of the exact mechanism, if it came to light that a patient died or experienced a bad outcome because of getting bad medical advice from an LLM, the LLM company should be considered responsible. It is unethical to offer medical advice while refusing to take responsibility for the consequences of that advice. The LLM companies should buy malpractice insurance if needed.

It is not the patient’s responsibility to “ensure their interactions with LLMs are safe”

In a previous post I noted that some safety issues can arise because patients don’t know what information they should or shouldn’t include when chatting with LLMs about medical problems. However–this should not be construed as a judgement that it’s the “patient’s responsibility” to make their interactions with LLMs safe. It is not the patient’s responsibility. It is the responsibility of LLM companies to continuously improve the way their models interact with patients so that patients aren’t left guessing about the right way to interact with the models.

Getting better at taking histories

A year ago, LLMs wouldn’t do any semblance of history-taking, which was a serious problem. But things appear to be moving in the right direction on that front. For example, if I go into ChatGPT today and type in “What should I do about my headache?” the response is a blend of advice and questions, including this piece towards the beginning: “First: Quick Self-Check. Ask yourself: Have I had enough water today? Did I sleep poorly? Have I been staring at screens for hours? Am I stressed or tense (especially neck/shoulders)? Did I skip caffeine if I usually drink it?” The inclusion of questions is an improvement, where previously, there would have been only advice. The more LLMs can orient towards appropriate history-taking, the more feasible it will become for them to respond safely.

We need more research into how LLM use impacts real patient behavior

Finally, we need more research on the whole picture, especially the downstream real-world implications of patient-LLM interactions. What kinds of actions are patients taking as a result of LLM advice? Do they seek care sooner? Delay care? Treat themselves at home? And what are the end results of these actions–do the patients get better? Get worse?

LLMs are a promising technology to empower patients, but we need new regulatory frameworks and modeling innovations to continue improving the safety of their patient interactions.

Stay updated

Want to be the first to hear about my upcoming book bridging healthcare, artificial intelligence, and business—and get a free list of my favorite health AI resources? Sign up here.