A General Methodology to Quantify Biases in Natural Language Data

被引:2
|
作者
Chen, Jiawei [1 ]
Xu, Anbang [2 ]
Liu, Zhe [2 ]
Guo, Yufan [2 ]
Liu, Xiaotong [2 ]
Tong, Yingbei [2 ]
Akkiraju, Rama [2 ]
Carroll, John M. [3 ]
机构
[1] Google, 1600 Amphitheatre Pkwy, Mountain View, CA 94043 USA
[2] IBM Corp, Almaden Res Ctr, San Jose, CA USA
[3] Penn State Univ, Coll Informat Sci & Technol, University Pk, PA 16802 USA
关键词
Quantify bias; natural language data;
D O I
10.1145/3334480.3382949
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Biases in data, such as gender and racial stereotypes, are propagated through intelligent systems and amplified at end-user applications. Existing studies detect and quantify biases based on pre-defined attributes. However, in real practices, it is difficult to gather a comprehensive list of sensitive concepts for various categories of biases. We propose a general methodology to quantify dataset biases by measuring the difference of its data distribution with a reference dataset using Maximum Mean Discrepancy. For the case of natural language data, we show that lexicon-based features quantify explicit stereotypes, while deep learning-based features further capture implicit stereotypes represented by complex semantics. Our method provides a more flexible way to detect potential biases.
引用
收藏
页数:9
相关论文
共 50 条
  • [1] Reducing Intrinsic and Extrinsic Data Biases for Moment Localization with Natural Language
    Yin, Jiong
    Li, Liang
    Zhang, Jiehua
    Yan, Chenggang
    Zhang, Lei
    Zhu, Zunjie
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4584 - 4594
  • [2] Data to Value: An 'Evaluation-First' Methodology for Natural Language Projects
    Leidner, Jochen L.
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2022), 2022, 13286 : 517 - 523
  • [3] Natural Language Processing to Quantify Microbial Keratitis Measurements
    Maganti, Nenita
    Tan, Huan
    Niziol, Leslie M.
    Amin, Sejal
    Hou, Andrew
    Singh, Karandeep
    Ballouz, Dena
    Woodward, Maria A.
    OPHTHALMOLOGY, 2019, 126 (12) : 1722 - 1724
  • [4] Using Natural Sentences for Understanding Biases in Language Models
    Alnegheimish, Sarah
    Guo, Alicia
    Sun, Yi
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 2824 - 2830
  • [5] A Methodology for Generating Natural Language Paraphrases
    Perikos, Isidoros
    Hatzilygeroudis, Ioannis
    2016 7TH INTERNATIONAL CONFERENCE ON INFORMATION, INTELLIGENCE, SYSTEMS & APPLICATIONS (IISA), 2016,
  • [6] A multifaceted approach to detect gender biases in Natural Language Generation
    Consuegra-Ayala, Juan Pablo
    Martinez-Murillo, Ivan
    Lloret, Elena
    Moreda, Paloma
    Palomar, Manuel
    KNOWLEDGE-BASED SYSTEMS, 2024, 303
  • [7] A methodology for the resolution of cashtag collisions on Twitter - A natural language processing & data fusion approach
    Evans, Lewis
    Owda, Majdi
    Crockett, Keeley
    Fernandez Vilas, Ana
    EXPERT SYSTEMS WITH APPLICATIONS, 2019, 127 : 353 - 369
  • [8] USING THE REVISED DICTIONARY OF AFFECT IN LANGUAGE TO QUANTIFY THE EMOTIONAL UNDERTONES OF SAMPLES OF NATURAL LANGUAGE
    Whissell, Cynthia
    PSYCHOLOGICAL REPORTS, 2009, 105 (02) : 509 - 521
  • [9] A General Framework for Gathering Data to Quantify Annual Visitation
    Snider, Anthony Glenn
    Hill, Jeffery
    Simmons, Susan
    Herstine, James
    JOURNAL OF PARK AND RECREATION ADMINISTRATION, 2018, 36 (01) : 1 - 21
  • [10] GENERAL METHODOLOGY FOR DATA CONVERSION AND RESTRUCTURING
    LUM, VY
    SHU, NC
    HOUSEL, BC
    IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 1976, 20 (05) : 483 - 497