Investigating the Frequency Distortion of Word Embeddings and Its Impact on Bias Metrics

被引:0
|
作者
Valentini, Francisco [1 ,2 ]
Sosa, Juan Cruz [3 ]
Slezak, Diego Fernandez [1 ,3 ]
Altszyler, Edgar [1 ,2 ]
机构
[1] CONICET UBA, Inst Invest Ciencias Computac, Buenos Aires, DF, Argentina
[2] Univ Buenos Aires UBA, Maestria Data Min, Buenos Aires, DF, Argentina
[3] UBA, Dept Computac, FCEyN, Buenos Aires, DF, Argentina
关键词
LANGUAGE;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent research has shown that static word embeddings can encode words' frequencies. However, little has been studied about this behavior. In the present work, we study how frequency and semantic similarity relate to one another in static word embeddings, and we assess the impact of this relationship on embedding-based bias metrics. We find that Skip-gram, GloVe and FastText embeddings tend to produce higher similarity between high-frequency words than between other frequency combinations. We show that the association between frequency and similarity also appears when words are randomly shuffled, and holds for different hyperparameter settings. This proves that the patterns we find are neither due to real semantic associations nor to specific parameters choices, and are an artifact produced by the word embeddings. To illustrate how frequencies can affect the measurement of biases related to gender, ethnicity, and affluence, we carry out a controlled experiment that shows that biases can even change sign or reverse their order when word frequencies change.(1)
引用
收藏
页码:113 / 126
页数:14
相关论文
共 50 条
  • [1] Bias in Word Embeddings
    Papakyriakopoulos, Orestis
    Hegelich, Simon
    Serrano, Juan Carlos Medina
    Marco, Fabienne
    FAT* '20: PROCEEDINGS OF THE 2020 CONFERENCE ON FAIRNESS, ACCOUNTABILITY, AND TRANSPARENCY, 2020, : 446 - 457
  • [2] Gender Bias in Word Embeddings: A Comprehensive Analysis of Frequency, Syntax, and Semantics
    Caliskan, Aylin
    Ajay, Pimparkar Parth
    Charlesworth, Tessa
    Wolfe, Robert
    Banaji, Mahzarin R.
    PROCEEDINGS OF THE 2022 AAAI/ACM CONFERENCE ON AI, ETHICS, AND SOCIETY, AIES 2022, 2022, : 156 - 170
  • [3] Approximating additive distortion of embeddings into line metrics
    Dhamdhere, K
    APPROXIMATION, RANDOMIZATION, AND COMBINATORIAL OPTIMIZATION: ALGORITHMS AND TECHNIQUES, PROCEEDINGS, 2004, 3122 : 96 - 104
  • [4] Gender Bias in Contextualized Word Embeddings
    Zhao, Jieyu
    Wangt, Tianlu
    Yatskart, Mark
    Cotterell, Ryan
    Ordonezt, Vicente
    Chang, Kai-Wei
    2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 629 - 634
  • [5] Understanding the Origins of Bias in Word Embeddings
    Brunet, Marc-Etienne
    Alkalay-Houlihan, Colleen
    Anderson, Ashton
    Zemel, Richard
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
  • [6] Dual embeddings and metrics for word and relational similarity
    Li, Dandan
    Summers-Stay, Douglas
    ANNALS OF MATHEMATICS AND ARTIFICIAL INTELLIGENCE, 2020, 88 (5-6) : 533 - 547
  • [7] Dual embeddings and metrics for word and relational similarity
    Dandan Li
    Douglas Summers-Stay
    Annals of Mathematics and Artificial Intelligence, 2020, 88 : 533 - 547
  • [8] Investigating the Impact of Pre-trained Word Embeddings on Memorization in Neural Networks
    Thomas, Aleena
    Adelani, David Ifeoluwa
    Davody, Ali
    Mogadala, Aditya
    Klakow, Dietrich
    TEXT, SPEECH, AND DIALOGUE (TSD 2020), 2020, 12284 : 273 - 281
  • [9] Effect of dimensionality change on the bias of word embeddings
    Rai, Rohit Raj
    Awekar, Amit
    PROCEEDINGS OF 7TH JOINT INTERNATIONAL CONFERENCE ON DATA SCIENCE AND MANAGEMENT OF DATA, CODS-COMAD 2024, 2024, : 601 - 602
  • [10] Investigation of Gender Bias in Turkish Word Embeddings
    Sevim, Nurullah
    Koc, Aykut
    29TH IEEE CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS (SIU 2021), 2021,