A toolbox for surfacing health equity harms and biases in large language models

被引:0
|
作者
Pfohl, Stephen R. [1 ]
Cole-Lewis, Heather [1 ]
Sayres, Rory [1 ]
Neal, Darlene [1 ]
Asiedu, Mercy [1 ]
Dieng, Awa [2 ]
Tomasev, Nenad [2 ]
Rashid, Qazi Mamunur [1 ]
Azizi, Shekoofeh [2 ]
Rostamzadeh, Negar [1 ]
McCoy, Liam G. [3 ]
Celi, Leo Anthony [4 ,5 ,6 ]
Liu, Yun [1 ]
Schaekermann, Mike [1 ]
Walton, Alanna [2 ]
Parrish, Alicia [2 ]
Nagpal, Chirag [1 ]
Singh, Preeti [1 ]
Dewitt, Akeiylah [1 ]
Mansfield, Philip [2 ]
Prakash, Sushant [1 ]
Heller, Katherine [1 ]
Karthikesalingam, Alan [1 ]
Semturs, Christopher [1 ]
Barral, Joelle [2 ]
Corrado, Greg [1 ]
Matias, Yossi [1 ]
Smith-Loud, Jamila [1 ]
Horn, Ivor [1 ]
Singhal, Karan [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
[2] Google DeepMind, Mountain View, CA USA
[3] Univ Alberta, Edmonton, AB, Canada
[4] MIT, Lab Computat Physiol, Cambridge, MA USA
[5] Beth Israel Deaconess Med Ctr, Div Pulm Crit Care & Sleep Med, Boston, MA USA
[6] Harvard TH Chan Sch Publ Hlth, Dept Biostat, Boston, MA USA
基金
美国国家科学基金会;
关键词
D O I
10.1038/s41591-024-03258-2
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Large language models (LLMs) hold promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. We present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and conduct a large-scale empirical case study with the Med-PaLM 2 LLM. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases and EquityMedQA, a collection of seven datasets enriched for adversarial queries. Both our human assessment framework and our dataset design process are grounded in an iterative participatory approach and review of Med-PaLM 2 answers. Through our empirical study, we find that our approach surfaces biases that may be missed by narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. While our approach is not sufficient to holistically assess whether the deployment of an artificial intelligence (AI) system promotes equitable health outcomes, we hope that it can be leveraged and built upon toward a shared goal of LLMs that promote accessible and equitable healthcare. Identifying a complex panel of bias dimensions to be evaluated, a framework is proposed to assess how prone large language models are to biased reasoning, with possible consequences on equity-related harms, and is applied to a large-scale and diverse user survey on Med-PaLM 2.
引用
收藏
页码:3590 / 3600
页数:30
相关论文
共 50 条
  • [41] Towards Understanding and Mitigating Social Biases in Language Models
    Liang, Paul Pu
    Wu, Chiyu
    Morency, Louis-Philippe
    Salakhutdinov, Ruslan
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [42] Generative language models exhibit social identity biases
    Tiancheng Hu
    Yara Kyrychenko
    Steve Rathje
    Nigel Collier
    Sander van der Linden
    Jon Roozenbeek
    Nature Computational Science, 2025, 5 (1): : 65 - 75
  • [43] Evaluation and mitigation of cognitive biases in medical language models
    Schmidgall, Samuel
    Harris, Carl
    Essien, Ime
    Olshvang, Daniel
    Rahman, Tawsifur
    Kim, Ji Woong
    Ziaei, Rojin
    Eshraghian, Jason
    Abadir, Peter
    Chellappa, Rama
    npj Digital Medicine, 2024, 7 (01)
  • [44] SiMHOMer: Siamese Models for Health Ontologies Merging and Validation Through Large Language Models
    Menad, Safaa
    Abdeddaim, Said
    Soualmia, Lina F.
    BIOINFORMATICS AND BIOMEDICAL ENGINEERING, PT I, IWBBIO 2024, 2024, 14848 : 117 - 129
  • [45] Health Equity Rounds: An Interdisciplinary Collaborative Case Conference Biases and in Pediatric Health Care
    Kalata, Kathryn
    Guerrieri, John
    Freeman, Brandi
    JOURNAL OF PEDIATRICS, 2023, 258
  • [46] Connections and Biases in Health Equity and Culture Research: A Semantic Network Analysis
    Martinez-Garcia, Mireya
    Camacho, Jose Manuel Villegas
    Hernandez-Lemus, Enrique
    FRONTIERS IN PUBLIC HEALTH, 2022, 10
  • [47] Additional Considerations in the Era of Large Language Models in Health Care Reply
    Lungren, Matthew P.
    Fishman, Elliot K.
    Chu, Linda C.
    Rizk, Ryan C.
    Rowe, Steven P.
    JOURNAL OF THE AMERICAN COLLEGE OF RADIOLOGY, 2024, 21 (07) : 991 - 992
  • [48] Evaluating search engines and large language models for answering health questions
    Marcos Fernández-Pichel
    Juan C. Pichel
    David E. Losada
    npj Digital Medicine, 8 (1)
  • [49] Generative AI and large language models in health care: pathways to implementation
    Marium M. Raza
    Kaushik P. Venkatesh
    Joseph C. Kvedar
    npj Digital Medicine, 7
  • [50] Large Language Models in Biomedical and Health Informatics: A Review with Bibliometric Analysis
    Yu, Huizi
    Fan, Lizhou
    Li, Lingyao
    Zhou, Jiayan
    Ma, Zihui
    Xian, Lu
    Hua, Wenyue
    He, Sijia
    Jin, Mingyu
    Zhang, Yongfeng
    Gandhi, Ashvin
    Ma, Xin
    JOURNAL OF HEALTHCARE INFORMATICS RESEARCH, 2024, : 658 - 711