Skip to main content

OpenAI Releases HealthBench Dataset to Test AI in Health Care

Medically reviewed by Carmen Pope, BPharm. Last updated on May 13, 2025.

By I. Edwards HealthDay Reporter

TUESDAY, May 13, 2025 — OpenAI has unveiled a large dataset to help test how well artificial intelligence (AI) models answer health care questions.

Experts call it a major step forward, but they also say more work is needed to ensure safety.

The dataset — called HealthBench — is OpenAI's first major independent health care project. It includes 5,000 “realistic health conversations,” each with detailed grading tools to evaluate AI responses, STAT News reported.

“Our mission as OpenAI is to ensure AGI is beneficial to humanity,” Karan Singhal, head of the San Francisco-based company's health AI team, said. AGI is shorthand for artificial general intelligence.

“One part of that is building and deploying technology," Singhal said. "Another part of it is ensuring that positive applications like health care have a place to flourish and that we do the right work to ensure that the models are safe and reliable in these settings.”

The dataset was created with help from 262 doctors who have worked in 60 countries. They provided more than 57,000 unique criteria to judge how well AI models answer health questions.

HealthBench aims to fix a common problem: Comparing different AI models fairly.

“What OpenAI has done is they have provided this in a scalable way from a really big, reputable brand that’s going to enable people to use this very easily,” Raj Ratwani, a health AI researcher at MedStar Health, said.

The 5,000 examples in HealthBench were made using synthesized conversations designed by physicians.

“We wanted to balance the benefits of being able to release the data with, of course, the privacy constraints of using realistic data," Singhal told STAT News.

The dataset also includes a special group of 1,000 hard examples where AI models struggled. OpenAI hopes this group “provides a worthy target for model improvements for months to come," STAT News reported.

OpenAI also tested its own models as well as models from Google, Meta, Anthropic and xAI. OpenAI’s o3 model scored the best, especially in communication quality, STAT News reported.

But models performed poorly in areas like context awareness and completeness, experts said.

Some warned about OpenAI grading its own models.

"In sensitive contexts like healthcare, where we are discussing life and death, that level of opacity is unacceptable," Hao explained.

Others noted that AI itself was used to grade some of the responses, which could result in errors being overlooked.

It “may hide errors shared by both model and grader,” Girish Nadkarni, head of artificial intelligence and human health at the Icahn School of Medicine at Mount Sinai in New York City, told STAT News.

He and others called for more reviews to ensure models work well in different countries and among different demographics.

“HealthBench improves LLM healthcare evaluation but still needs subgroup analysis and wider human review before it can support safety claims,” Nadkarni said.

Sources

  • STAT News, May 12, 2025

Disclaimer: Statistical data in medical articles provide general trends and do not pertain to individuals. Individual factors can vary greatly. Always seek personalized medical advice for individual healthcare decisions.

© 2025 HealthDay. All rights reserved.

Read this next

FDA Authorizes First AI Platform for Breast Cancer Prediction

FRIDAY, June 6, 2025 -- The U.S. Food and Drug Administration has granted de novo authorization to CLAIRITY BREAST, a first-in-class, image-based platform that can help predict a...

4 States Ask FDA to Lift Rules on Abortion Pill Mifepristone

FRIDAY, June 6, 2025 — Attorneys general from four states are asking the U.S. Food and Drug Administration (FDA) to lift restrictions on mifepristone, a pill used in...

Long-Lasting HIV Prevention Shot Headed Toward Approval

FRIDAY, June 6, 2025 — A new injection to prevent HIV is expected to be approved by the U.S. Food and Drug Administration (FDA) later this month. If approved, the shot...

More news resources

Subscribe to our newsletter

Whatever your topic of interest, subscribe to our newsletters to get the best of Drugs.com in your inbox.