A toolbox for surfacing health equity harms and biases in large language models

被引:0
|
作者
Pfohl, Stephen R. [1 ]
Cole-Lewis, Heather [1 ]
Sayres, Rory [1 ]
Neal, Darlene [1 ]
Asiedu, Mercy [1 ]
Dieng, Awa [2 ]
Tomasev, Nenad [2 ]
Rashid, Qazi Mamunur [1 ]
Azizi, Shekoofeh [2 ]
Rostamzadeh, Negar [1 ]
McCoy, Liam G. [3 ]
Celi, Leo Anthony [4 ,5 ,6 ]
Liu, Yun [1 ]
Schaekermann, Mike [1 ]
Walton, Alanna [2 ]
Parrish, Alicia [2 ]
Nagpal, Chirag [1 ]
Singh, Preeti [1 ]
Dewitt, Akeiylah [1 ]
Mansfield, Philip [2 ]
Prakash, Sushant [1 ]
Heller, Katherine [1 ]
Karthikesalingam, Alan [1 ]
Semturs, Christopher [1 ]
Barral, Joelle [2 ]
Corrado, Greg [1 ]
Matias, Yossi [1 ]
Smith-Loud, Jamila [1 ]
Horn, Ivor [1 ]
Singhal, Karan [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
[2] Google DeepMind, Mountain View, CA USA
[3] Univ Alberta, Edmonton, AB, Canada
[4] MIT, Lab Computat Physiol, Cambridge, MA USA
[5] Beth Israel Deaconess Med Ctr, Div Pulm Crit Care & Sleep Med, Boston, MA USA
[6] Harvard TH Chan Sch Publ Hlth, Dept Biostat, Boston, MA USA
基金
美国国家科学基金会;
关键词
D O I
10.1038/s41591-024-03258-2
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Large language models (LLMs) hold promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. We present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and conduct a large-scale empirical case study with the Med-PaLM 2 LLM. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases and EquityMedQA, a collection of seven datasets enriched for adversarial queries. Both our human assessment framework and our dataset design process are grounded in an iterative participatory approach and review of Med-PaLM 2 answers. Through our empirical study, we find that our approach surfaces biases that may be missed by narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. While our approach is not sufficient to holistically assess whether the deployment of an artificial intelligence (AI) system promotes equitable health outcomes, we hope that it can be leveraged and built upon toward a shared goal of LLMs that promote accessible and equitable healthcare. Identifying a complex panel of bias dimensions to be evaluated, a framework is proposed to assess how prone large language models are to biased reasoning, with possible consequences on equity-related harms, and is applied to a large-scale and diverse user survey on Med-PaLM 2.
引用
收藏
页码:3590 / 3600
页数:30
相关论文
共 50 条
  • [21] Exploring Reasoning Biases in Large Language Models Through Syllogism: Insights from the NeuBAROCO Dataset
    Ozeki, Kentaro
    Ando, Risako
    Morishita, Takanobu
    Abe, Hirohiko
    Mineshima, Koji
    Okada, Mitsuhiro
    arXiv,
  • [22] Fairness in AI-Driven Oncology: Investigating Racial and Gender Biases in Large Language Models
    Agrawal, Anjali
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (09)
  • [23] Large language models show human- like content biases in transmission chain experiments
    Acerbi, Alberto
    Stubbersfield, Joseph M.
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2023, 120 (44)
  • [24] The shaky foundations of large language models and foundation models for electronic health records
    Michael Wornow
    Yizhe Xu
    Rahul Thapa
    Birju Patel
    Ethan Steinberg
    Scott Fleming
    Michael A. Pfeffer
    Jason Fries
    Nigam H. Shah
    npj Digital Medicine, 6
  • [25] The shaky foundations of large language models and foundation models for electronic health records
    Wornow, Michael
    Xu, Yizhe
    Thapa, Rahul
    Patel, Birju
    Steinberg, Ethan
    Fleming, Scott
    Pfeffer, Michael A.
    Fries, Jason
    Shah, Nigam H.
    NPJ DIGITAL MEDICINE, 2023, 6 (01)
  • [26] Harnessing the Power of Large Language Models in Agricultural Safety & Health
    Shutske, John M.
    JOURNAL OF AGRICULTURAL SAFETY AND HEALTH, 2023, 29 (04): : 205 - 224
  • [27] Additional Considerations in the Era of Large Language Models in Health Care
    Arachchige, Arosh S. Perera Molligoda
    JOURNAL OF THE AMERICAN COLLEGE OF RADIOLOGY, 2024, 21 (07) : 990 - 991
  • [28] Potential of Large Language Models in Health Care: Delphi Study
    Denecke, Kerstin
    May, Richard
    Romero, Octavio Rivera
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [29] Enhancing health assessments with large language models: A methodological approach
    Wang, Xi
    Zhou, Yujia
    Zhou, Guangyu
    APPLIED PSYCHOLOGY-HEALTH AND WELL BEING, 2024,
  • [30] How to Safely Integrate Large Language Models Into Health Care
    Gottlieb, Scott
    Silvis, Lauren
    JAMA HEALTH FORUM, 2023, 4 (09):