Three recent studies have highlighted safety limitations of LLMs in direct patient interactions.
I led a red-teaming study with fifteen physicians, recently published in npj Digital Medicine. We found that LLMs often provide unsafe answers to patient-posed medical questions. The model with the highest percent of unsafe responses was ChatGPT (13.5% unsafe), followed closely by Llama (13.1% unsafe). Both of these chatbots had more than twice the rate of unsafe responses as the safest model (Claude, 5.0% unsafe). Here are some specific safety issues that arose:
- ChatGPT recommended shaking a child’s head to remove playdough from the ear;
- Gemini claimed it is safe to breastfeed an infant off a herpes-infected breast (it’s not safe–herpes can kill babies);
- Llama recommended putting tea tree oil on the eyelids (tea tree oil can cause significant eye toxicity).
This Nature Medicine study found that ChatGPT Health failed to triage patients properly. For serious life-threatening emergencies like diabetic ketoacidosis and impending respiratory failure, ChatGPT Health directed 52% of cases to 24 to 48 hour evaluation rather than the emergency department. To be clear, this kind of under-triage could also kill patients. Both under-triage and over-triage have negative consequences. Under-triage can result in patients not getting the timely care they need to survive a real emergency, while over-triage (sending too many healthy people to the emergency room) can result in overcrowded emergency rooms, strain on limited resources, and worse care for everyone.
Finally, this Nature Medicine study highlighted that doctors and patients do not interact in the same way with LLMs. When patients interacted with LLMs, they were only able to identify relevant medical conditions in 35% of cases. That means 65% of the time, the patient interaction with the LLM led to wrong conclusions.
How can we make this situation safer for patients?
We need new regulatory frameworks that govern LLM-patient interaction. Advanced medical question benchmarks are irrelevant.
Previous work showing that LLMs can pass the USMLE/medical boards has sometimes been used to claim “LLMs can be doctors.” This conclusion is wrong. As we discuss more in our paper, answering advanced medical questions and answering patient-posed medical questions are different tasks.

Advanced medical questions are lengthy, phrased with advanced terminology, and strategically designed to include all necessary information and have only one right answer. This is an easier task for LLMs because LLMs produce better outputs when given a lot of information that is precisely framed.
In contrast, patient-posed medical questions are short, phrased with layperson vocabulary, open-ended, and missing critical clinical context that must be further elucidated through history-taking. This is a harder task for LLMs.
Now, you may ask, if safely responding to patient-posed medical questions is challenging, and also fundamental to being a doctor, why isn’t it tested on the boards? Well, it kind of used to be–the USMLE Step 2 Clinical Skills exam involved interacting with trained actors behaving as standardized patients. The exam tested real-world patient communication, physical exam skills, and clinical documentation. It was discontinued during COVID.
So we need something like USMLE Step 2 Clinical Skills, but for LLMs. And the LLMs need it even more badly than the human trainees, because the human trainees are immersed day-in and day-out in real patient interactions, getting feedback from supervising physicians, while the LLMs don’t get to go to medical school or residency. Significant clinical knowledge is passed down at the patient bedside from supervising clinicians to trainees during medical school and residency without being digitally recorded. Skills currently passed person-to-person and not conveyed in large amounts of digital training data include history-taking, and the development of “clinical gestalt” which involves a complex synthesis of visual, auditory, sensory, social, and intellectual information.
Which brings me to my first point: the government needs to step in and create a new regulatory framework for LLM-patient interactions. Here’s a sketch of how it could work:
- Patients volunteer to submit medical questions they might ask a chatbot;
- Doctors flesh out different clinical cases around each question. For example if the question is “How do I get rid of my headache?” one case could be for a patient with a subarachnoid hemorrhage (emergency), and another for a patient with chronic tension headaches (not an emergency).
- The LLM being evaluated is given just the patient’s question to start (“How do I get rid of my headache?”) and it needs to chat with the patient to figure out what might be going on and give safe advice.
- The LLM does this for every case–meaning there will be multiple conversations that start with the same question but are expected to go in different directions depending on the underlying patient situation.
- Every sentence of the LLM’s response is then analyzed by a group of other LLMs created by different model providers, and each LLM independently votes on whether that piece of advice is safe, given access to the patient’s entire clinical case description. (Ideally, ChatGPT does not evaluate ChatGPT, and Claude does not evaluate Claude, because then the model might not find its own blind spots.)
It’s not a perfect framework, but we need something along these lines that directly mimics open-ended LLM-patient interaction. We have FDA approval for medical devices that diagnose, cure, mitigate, or treat disease. We need an FDA approval process for LLMs that are currently trying to diagnose, cure, mitigate, or treat disease through their text advice.
LLM providers need to be held responsible for bad patient outcomes
If LLM providers are going to give out medical advice they need to be responsible for what happens to the patient when that advice is bad. Now, this is tricky to enforce in practice: if a patient shows up too late to the emergency room and dies, who’s going to figure out if there was a delay in care due to bad LLM advice? Perhaps family or friends may realize it. (I certainly don’t think the government should be snooping in to patients’ private data.) Regardless of the exact mechanism, if it came to light that a patient died or experienced a bad outcome because of getting bad medical advice from an LLM, the LLM company should be responsible. It is unethical to offer medical advice while refusing to take responsibility for the consequences of that advice. The LLM companies can buy malpractice insurance if needed.
LLMs need better training data specific to triage and patient question-answering
A lot of LLM training data is not optimized for safe patient interactions. LLMs are trained on vast quantities of Internet-scraped text. This includes peer-reviewed literature, but also health advice written by laypeople, medical misinformation, racist and sexist content, and outdated guidelines.
LLMs as models also have several known issues that complicate answering patient questions safely:
- Hallucination and fabrication of falsehoods at random;
- Sycophantic tendencies;
- Answering immediately rather than taking a history first;
- Overconfidence;
- Worse-quality outputs when there are typos or misspellings.
It would be helpful to have a dataset that includes doctor-patient chats like this:
- Patient asks a question
- Doctor responds and does the history-taking that’s needed in order to safely answer the question;
- Once the doctor knows enough to answer the patient’s question, the doctor provides an answer, which may include a “triage recommendation” (e.g., if the patient could be having an emergency, the doctor directs them to the emergency room).
Logistically, where could companies or researchers get a dataset like this?
Idea 1: Researchers could set up a third-party website where patients can chat in real time with doctors about a concern. Perhaps the patient agrees it’s OK to use the data for making LLMs better, in exchange for instant access to a doctor for free. The problem: it would be expensive to pay for the doctors’ time, and expensive to pay for the marketing needed to make patients aware of this and interested in it. Also, why would a patient do this when they can just privately talk to ChatGPT?
Idea 2: A triage company could offer to provide hospitals a service where there’s a human doctor-based emergency room chat function. So the patient has a medical concern, they “chat the emergency room,” and it’s really some emergency medicine doctor working for this triage company who chats with the patient and figures out if they really should go to the emergency room or not. The problem: as soon as you put a real human doctor in the loop, there’s actual medical malpractice to worry about! Apparently, only LLM providers are allowed to give bad triage advice without taking any responsibility. Also, this skews towards emergency medicine/the triage problem, and doesn’t really tackle the more general problem of safely answering any kind of patient-posed medical question–including ones patients don’t think are emergencies (but might be actual emergencies).
Idea 3: Would it be possible to use chats between doctors and patients that take place within the electronic health record (e.g., MyChart)? They are not the right data distribution. Because these questions are not something the patient wants an answer to right away, since patients know they’re not going to get an instant answer from their doctor inside the EHR. Even if this somehow was the right data distribution, there’s no way it would be ethically or legally acceptable for an EHR to sell plain text versions of private doctor-patient conversations to LLM providers for training. That is creepy and a massive HIPAA violation.
Idea 4: AI scribes record and transcribe verbal chats between patients and doctors. Is this useful? Well, nobody would want their private conversation with their doctor to be used for training LLMs. Also, verbal interactions are different from text interactions because patients and doctors don’t speak the way they write. The context is different and the behavior is different: the patient is already physically in a doctor’s office which changes what the conversation is about relative to a 2 am LLM chat session.
Idea 5: The LLM providers could insert real doctors behind the scenes into chats with patients in real time. A patient asks a medical question, and the LLM provider asks if they’re wiling to talk to a real doctor instead. If yes, the LLM provider seamlessly switches the chat over to a real doctor behind the scenes. This might be the best data option out of the ideas, but it still has some limitations. For one, the data distribution is still going to be off. Some patients will say no and want to talk to an LLM instead of a human, maybe because they’re embarrassed about the problem, and that’s going to be non-random behavior. But this idea is likely the closest one could get to creating a dataset of real patient-doctor interactions that also matches what kinds of medical questions patients ask LLMs.
Idea 6: Rather than using patient-physician data, use LLMs to refine themselves. Rejection sampling is a form of supervised learning on filtered data in which LLMs generate massive numbers of reasoning traces and then only those that arrive at correct solutions are kept for further training. Something like this could be extensible to clinical reasoning. An LLM could be prompted to act as both doctor and patient to create a simulated conversation. It could be explicitly prompted to ask history taking questions that are necessary to elucidate foundational details about the patient before providing advice. The conversation as a whole could then be checked by another LLM or group of LLMs to see whether any elements of it are unsafe, contain any problematic information, are missing any important information, or have any other issues. Conversations that pass these checks could then be used for additional LLM training.
The patient’s geographic location matters
Safely answering a patient-posed medical question depends not only on the details of the individual patient themselves but also their care setting. If they live 5 minutes away from a major US academic medical center, that’s different from living 3 hours away from a tiny rural hospital, and that’s different from living in a low-income country where they have to walk 10 miles to get to a understaffed hospital without cutting-edge medical tech.
It is not the patient’s responsibility to “ensure their interactions with LLMs are safe”
In a previous post I noted that some safety issues can arise because patients don’t know what information they should or shouldn’t include when chatting with LLMs about medical problems. However–this should not be construed as a judgement that it’s the “patient’s responsibility” to make their interactions with LLMs safe. It is not the patient’s responsibility. It is 100% the responsibility of LLM companies to continuously improve the way their models interact with patients so that patients aren’t left guessing about the right way to interact with the models.
Getting better at taking histories
A year ago, LLMs wouldn’t do any semblance of history-taking, which was a serious problem. But things appear to be moving in the right direction on that front. For example, if I go into ChatGPT today and type in “What should I do about my headache?” the response is a blend of advice and questions, including this piece towards the beginning: “First: Quick Self-Check. Ask yourself: Have I had enough water today? Did I sleep poorly? Have I been staring at screens for hours? Am I stressed or tense (especially neck/shoulders)? Did I skip caffeine if I usually drink it?” The inclusion of questions is an improvement, where previously, there would have been only advice. The more LLMs can orient towards appropriate history-taking, the more feasible it will become for them to respond safely.
We need more research into how LLM use impacts real patient behavior
Finally, we need more research on the whole picture, especially the downstream real-world implications of patient-LLM interactions. What kinds of actions are patients taking as a result of LLM advice? Do they seek care sooner? Delay care? Treat themselves at home? And what are the end results of these actions–do the patients get better? Get worse?
LLMs are a promising technology, but we need new regulatory frameworks and dataset and modeling innovations to continue improving the safety of their interactions with patients.
Stay updated
Want to be the first to hear about my upcoming book bridging healthcare, artificial intelligence, and business—and get a free list of my favorite health AI resources? Sign up here.
