Skip to main content

OpenAI Releases HealthBench Dataset to Test AI in Health Care

By I. Edwards HealthDay Reporter

Medically reviewed by Carmen Pope, BPharm. Last updated on May 13, 2025.

via HealthDay

TUESDAY, May 13, 2025 — OpenAI has unveiled a large dataset to help test how well artificial intelligence (AI) models answer health care questions.

Experts call it a major step forward, but they also say more work is needed to ensure safety.

The dataset — called HealthBench — is OpenAI's first major independent health care project. It includes 5,000 “realistic health conversations,” each with detailed grading tools to evaluate AI responses, STAT News reported.

“Our mission as OpenAI is to ensure AGI is beneficial to humanity,” Karan Singhal, head of the San Francisco-based company's health AI team, said. AGI is shorthand for artificial general intelligence.

“One part of that is building and deploying technology," Singhal said. "Another part of it is ensuring that positive applications like health care have a place to flourish and that we do the right work to ensure that the models are safe and reliable in these settings.”

The dataset was created with help from 262 doctors who have worked in 60 countries. They provided more than 57,000 unique criteria to judge how well AI models answer health questions.

HealthBench aims to fix a common problem: Comparing different AI models fairly.

“What OpenAI has done is they have provided this in a scalable way from a really big, reputable brand that’s going to enable people to use this very easily,” Raj Ratwani, a health AI researcher at MedStar Health, said.

The 5,000 examples in HealthBench were made using synthesized conversations designed by physicians.

“We wanted to balance the benefits of being able to release the data with, of course, the privacy constraints of using realistic data," Singhal told STAT News.

The dataset also includes a special group of 1,000 hard examples where AI models struggled. OpenAI hopes this group “provides a worthy target for model improvements for months to come," STAT News reported.

OpenAI also tested its own models as well as models from Google, Meta, Anthropic and xAI. OpenAI’s o3 model scored the best, especially in communication quality, STAT News reported.

But models performed poorly in areas like context awareness and completeness, experts said.

Some warned about OpenAI grading its own models.

"In sensitive contexts like healthcare, where we are discussing life and death, that level of opacity is unacceptable," Hao explained.

Others noted that AI itself was used to grade some of the responses, which could result in errors being overlooked.

It “may hide errors shared by both model and grader,” Girish Nadkarni, head of artificial intelligence and human health at the Icahn School of Medicine at Mount Sinai in New York City, told STAT News.

He and others called for more reviews to ensure models work well in different countries and among different demographics.

“HealthBench improves LLM healthcare evaluation but still needs subgroup analysis and wider human review before it can support safety claims,” Nadkarni said.

Sources

  • STAT News, May 12, 2025

Disclaimer: Statistical data in medical articles provide general trends and do not pertain to individuals. Individual factors can vary greatly. Always seek personalized medical advice for individual healthcare decisions.

© 2025 HealthDay. All rights reserved.

Read this next

Ritz Peanut Butter Crackers Recalled Over Labeling Mistake

FRIDAY, July 11, 2025 — A labeling mistake has led to a nationwide recall of four different types of Ritz Peanut Butter Cracker Sandwich cartons, federal health officials...

RFK Jr. Cancels Key U.S. Health Panel Meeting Without Warning, Raising Concerns

FRIDAY, July 11, 2025 — U.S. Health Secretary Robert F. Kennedy Jr. canceled a meeting of government health panel that helps guide what preventive care is covered by...

U.S. Cuts to HIV Aid Could Lead to 4 Million Deaths, U.N. Warns

FRIDAY, July 11, 2025 — Global deaths from AIDS have dropped to their lowest levels in more than 30 years, in part due to efforts to fight HIV. But U.S. funding cuts could...

More news resources

Subscribe to our newsletter

Whatever your topic of interest, subscribe to our newsletters to get the best of Drugs.com in your inbox.