A new study examines how large language models perform in a variety of medical contexts, including real-world emergency room settings—where at least one model appeared to be more accurate than human doctors.
The study was was published this week in Science and comes from a research team led by physicians and computer scientists at Harvard Medical School and Beth Israel Deaconess Medical Center. The researchers said they conducted a variety of experiments to measure how OpenAI’s models compare to human doctors.
In one experiment, researchers focused on 76 patients who entered Beth Israel’s emergency room, comparing diagnoses offered by two internal medicine physicians to those generated by OpenAI’s o1 and 4o models. These diagnoses were evaluated by two other attending physicians, who were unaware of which were human-derived and which were AI-derived.
“At each diagnostic point of contact, o1 either nominally performed better or equivalently to both attending physicians and 4o,” the study said, adding that the differences “were particularly pronounced at the first diagnostic point of contact (initial ER triage), where there is the least available patient information and the most urgent need for the correct decision.
At Harvard Medical School press release about the study, the researchers emphasized that they did not “pre-process the data at all” — the AI models were presented with the same information that was available in the electronic medical records at the time of each diagnosis.
With this information, the o1 model was able to deliver “the exact or very close diagnosis” 67% of the time when screening, compared to one doctor who had the exact or close diagnosis 55% of the time and the other who hit the mark 50% of the time.
“We tested the AI model on nearly every benchmark, and it outperformed both previous models and our physician baselines,” Arjun Manrai, head of the artificial intelligence lab at Harvard Medical School and one of the study’s lead authors, said in the press release.
Techcrunch event
San Francisco, California
|
13-15 October 2026
To be clear, the study did not claim that AI is ready to make real life-or-death decisions in the emergency room. Instead, he said the findings show an “urgent need for future trials to evaluate these technologies in real-world patient care settings.”
The researchers also noted that they only studied how the models performed when provided with text-based information, and that “existing studies suggest that current baseline models are more limited in reasoning against non-textual inputs.”
Adam Rodman, a Beth Israel physician who is also one of the study’s lead authors, warned the Guardian that there is “no formal accountability framework right now” around AI diagnoses and that patients still “want people to guide them in life or death decisions [and] to guide them through difficult treatment decisions.”
In a post about the studyKristen Panthagani, an emergency physician, said this is “an interesting AI study that has gotten some very hyped headlines,” especially because it compared AI diagnoses to those by internal medicine doctors, not ER doctors.
“If we’re going to compare AI tools to the clinical abilities of doctors, we should start by comparing to doctors who actually practice that specialty,” Pantaghani said. “I wouldn’t be surprised if an LLM could beat a dermatologist on a neurosurgery exam, [but] that’s not a particularly useful thing to know.”
He also asserted, “As an ER doctor seeing a patient for the first time, my primary goal is not to guess your final diagnosis. My primary goal is to determine if you have a condition that could kill you.”
This post and title have been updated to reflect the fact that the diagnoses in the study came from attending internal medicine physicians and to include comments from Kristen Panthagani.
When you purchase through links in our articles, we may earn a small commission. This does not affect our editorial independence.
