OpenAI Releases HealthBench Dataset to Test AI in Health Care

By I. Edwards HealthDay Reporter

Medically reviewed by Carmen Pope, BPharm. Last updated on May 13, 2025.

TUESDAY, May 13, 2025 — OpenAI has unveiled a large dataset to help test how well artificial intelligence (AI) models answer health care questions.

Experts call it a major step forward, but they also say more work is needed to ensure safety.

The dataset — called HealthBench — is OpenAI's first major independent health care project. It includes 5,000 “realistic health conversations,” each with detailed grading tools to evaluate AI responses, STAT News reported.

“Our mission as OpenAI is to ensure AGI is beneficial to humanity,” Karan Singhal, head of the San Francisco-based company's health AI team, said. AGI is shorthand for artificial general intelligence.

“One part of that is building and deploying technology," Singhal said. "Another part of it is ensuring that positive applications like health care have a place to flourish and that we do the right work to ensure that the models are safe and reliable in these settings.”

The dataset was created with help from 262 doctors who have worked in 60 countries. They provided more than 57,000 unique criteria to judge how well AI models answer health questions.

HealthBench aims to fix a common problem: Comparing different AI models fairly.

“What OpenAI has done is they have provided this in a scalable way from a really big, reputable brand that’s going to enable people to use this very easily,” Raj Ratwani, a health AI researcher at MedStar Health, said.

The 5,000 examples in HealthBench were made using synthesized conversations designed by physicians.

“We wanted to balance the benefits of being able to release the data with, of course, the privacy constraints of using realistic data," Singhal told STAT News.

The dataset also includes a special group of 1,000 hard examples where AI models struggled. OpenAI hopes this group “provides a worthy target for model improvements for months to come," STAT News reported.

OpenAI also tested its own models as well as models from Google, Meta, Anthropic and xAI. OpenAI’s o3 model scored the best, especially in communication quality, STAT News reported.

But models performed poorly in areas like context awareness and completeness, experts said.

Some warned about OpenAI grading its own models.

"In sensitive contexts like healthcare, where we are discussing life and death, that level of opacity is unacceptable," Hao explained.

Others noted that AI itself was used to grade some of the responses, which could result in errors being overlooked.

It “may hide errors shared by both model and grader,” Girish Nadkarni, head of artificial intelligence and human health at the Icahn School of Medicine at Mount Sinai in New York City, told STAT News.

He and others called for more reviews to ensure models work well in different countries and among different demographics.

“HealthBench improves LLM healthcare evaluation but still needs subgroup analysis and wider human review before it can support safety claims,” Nadkarni said.

Sources

STAT News, May 12, 2025

Disclaimer: Statistical data in medical articles provide general trends and do not pertain to individuals. Individual factors can vary greatly. Always seek personalized medical advice for individual healthcare decisions.

Posted May 2025

More news resources

Subscribe to our newsletter

Whatever your topic of interest, subscribe to our newsletters to get the best of Drugs.com in your inbox.

Recently Approved

Brinsupri
Brinsupri (brensocatib) is a dipeptidyl peptidase 1 inhibitor for the treatment...
Hernexeos
Hernexeos (zongertinib) is a kinase inhibitor used for the treatment of...
Modeyso
Modeyso (dordaviprone) is a protease activator used for the treatment of...

More drug approvals

OpenAI Releases HealthBench Dataset to Test AI in Health Care

Sources

Read this next

Summer COVID Surge Continues as Wastewater Levels Rise, CDC Says

Update: NYC Legionnaires’ Outbreak Grows to 90 Cases; 3 Deaths Reported

Conch Blowing Could Be Effective Treatment For Sleep Apnea

More news resources

Subscribe to our newsletter