A recent New York Times article titled AI Chatbots Defeated Doctors at Diagnosing Illness covers a study recently published in JAMA Network Open. The main point of the study was to determine whether access to an LLM could help doctors diagnose more effectively. The outcome? Physicians using an LLM did not perform any better than physicians working alone. But this isn’t the part of the study that generated the most buzz. Instead, the news has focused on a secondary discovery of the study, which was that an LLM alone outperformed physicians at the diagnosis step. What’s going on here?
What’s going on is that even for the cases where the LLM “did the diagnosis” it was, in fact, not doing the diagnosis by itself. What do I mean? The LLM was fed case reports, carefully crafted by doctors to include all of the information needed to make the diagnosis, and the LLM just had to do the very final step of outputting what that diagnosis was, without doing any of the diagnostic work that went into creating the case report in this first place. This aspect of the study has been swept under the rug, but it is necessary to properly contextualize what this study implies about LLM capabilities.
The study itself states that the cases provided to the LLM were “based on actual patients” and “included information available on initial diagnostic evaluation, such as history, physical examination, and laboratory test results”. This phrasing is misleading unless you are familiar with how medicine works. It’s actually referring to the information available at the end of the initial diagnostic evaluation, not the beginning. At the beginning – the true beginning, where a doctor starts – nothing is known about a patient.
A thin, stooped man limps into the emergency department with shortness of breath. What do we know about him? Nothing! The administrative staff at the front desk have to ask his name and figure out if he’s in the electronic health record system (EHR) or not. If he is in the EHR, a doctor needs to parse through the incredibly disorganized mess of records to locate potentially relevant information about the patient – for example, a diagnosis of COPD or a previous hospital admission for pneumonia. If he’s not in the EHR, then the doctor still knows only know his name and birthday. A triage nurse will begin taking a history using her clinical expertise, and doctor will then go in to the patient’s room and take an even more comprehensive history, to figure out what is going on with him this time, today. The history is the patient’s story. It matters if he just woke up in the morning suddenly short of breath vs. if he visited his niece who’s sick with a cold vs. if he spent an hour inhaling paint thinner. A doctor also does a physical exam and listens to his heart and lungs, and synthesizes all the complex auditory information coming through the stethoscope into a few descriptive words: maybe “wheezing in the right upper lobe” or “CTAB” (clear to auscultation bilaterally, i.e. everything sounds normal). A doctor needs to order lab tests and imaging based on the diagnostic process already active in her mind. Does he need a chest x-ray? A CT scan? A CBC? The doctor will decide based on her differential diagnosis.
Then after the results come in, doctors make the diagnosis, and all of these pieces that human beings gathered up in an intelligent way can be written down into a case report that is specifically designed to lead medical professionals to come to the correct diagnostic conclusion. A case report is not some neutral document that regurgitates every random medical fact about a patient in no particular order. It is a persuasive document that is specifically designed to handhold the reader to be able to make the same correct diagnosis that the medical team ultimately made based on their care of the patient. As part of this, the case report leaves out anything useless. The case report isn’t going to mention that the patient with shortness of breath stubbed his toe on the dresser last week, even though he might mention this and wave his toe around. The case report represents 99% of the diagnostic process, because the entire way it was constructed was based around a diagnostic process happening in a doctor’s mind.
So basically, in the study, doctors still did 99% of the work for the LLM, because doctors wrote the case report. The study was not actually comparing diagnostic ability of an LLM to diagnostic ability of doctors, because a huge part of diagnostic ability is doing all the other critical doctor-y things described above. The New York Times article glosses over this, and the study itself does too, admitting only, “this does not capture competence in many other areas important to clinical reasoning, including patient interviewing and data collection” — as if patient interviewing and data collection are somehow separate from the diagnostic process, when they are not. They are an integral part of the diagnostic process.
A few other issues:
The total number of cases considered was small: only 6 total. This doesn’t cover a particularly broad or deep diagnostic scope.
The clinical vignettes used were adapted from “a landmark study” (this one) which happens to have been published in 1994. That was thirty years ago. Medicine has changed a lot in thirty years. In 1994, COVID didn’t exist, the chicken pox vaccine didn’t exist, and the human genome hadn’t been sequenced (ref). I can understand why the authors would want to use classic cases from previously published research, because researchers in general like to cite past research, but a key part of a doctors’ job is to be able to engage in a modern diagnostic process, not a 30-year-old diagnostic process.
Finally, the authors also point out in their discussion that they did a significant amount of work on prompt engineering, providing details on the “task, context, and instructions”. It’s not clear on what cases they did this prompt engineering. If they did the prompt engineering on some other cases that were not the 6 considered, then that’s fair. But if they did any prompt engineering specifically on the 6 cases they ultimately considered, that would be analogous to training on the test set.
Overall, I do commend the authors of this research study for its original purpose – formally evaluating whether LLMs can help doctors with diagnosis, since there has been a lot of discussion around giving doctors access to LLMs during their clinical workflows and we need to know if this will help or not. But I wish the news coverage of research like this would stop pretending that diagnosis is a single step. If doctors nicely arrange 999 pieces of a jigsaw puzzle and an LLM puts in the 1000th piece, that reveals more about how well doctors can write case reports than about how well LLMs can diagnose.
[Featured Image by Ryoji Iwata on Unsplash]
Want to be the first to hear about my upcoming book bridging healthcare, artificial intelligence, and business—and get a free list of my favorite health AI resources? Sign up here.
