News

Stanford benchmarks AI agents in healthcare | Health IT

By September 16, 2025No Comments

A team of Stanford University researchers developed benchmarks for measuring the accuracy and effectiveness of AI agents to assist physicians and published their findings in the New England Journal of Medicine AI.

The team tested how well large language models could process medical information and complete administrative tasks for physicians, such as ordering tests, prescribing medication and retrieving patient information.

“Chatbots say things. AI agents can do things,” said Jonathan Chen, associate professor of medicine and biomedical data science and the paper’s senior author, in a Stanford news release. “This means they could theoretically directly retrieve patient information from the electronic medical record, reason about that information, and take action by directly entering in orders for tests and medications. This is a much higher bar for autonomy in the high-stakes world of medical care. We need a benchmark to establish the current state of AI capability on reproducible tasks that we can optimize toward.”

The researchers developed an EHR environment with 100 realistic patient profiles with 785,000 records to test how well the LLMs could act as AI agents and complete 300 clinical tasks. They measured error type and frequency to gain understanding of how the AI agents would act in real-world situations and found many struggled with nuanced reasoning required for complex workflows. Interoperability also presented a challenge when records came from multiple health systems.

The models’ success rate was:

Claude 3.5 Sonnet v2: 69.67%

GPT-40: 64%

DeepSeek-V3 (685B, open): 62.67%

Gemini-1.5 Pro: 62%

GPT-40-mini: 56.33%

O3-mini: 51.67%

Qwen2.5 (72B, open): 51.33%

Llama 3.3 (70B, open): 46.33%

Gemini 2.0 Flash: 38.33%

Gemma2 (27B, open): 19.33%

Gemini 2.0 Pro: 18%

Mistral v0.3 (7B, open): 4%

The newer LLMs showed improvement over older models, and researchers see a pathway forward to testing the tools in real-world pilots.

The post Stanford benchmarks AI agents in healthcare appeared first on Becker’s Hospital Review | Healthcare News & Analysis.

Health IT

Leave a Reply