The price of debiasing automatic metrics in natural language evaluation

被引:0
|
作者
Chaganty, Arun Tejasvi [1 ]
Mussmann, Stephen [1 ]
Liang, Percy [1 ]
机构
[1] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA
关键词
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
For evaluating generation systems, automatic metrics such as BLEU cost nothing to run but have been shown to correlate poorly with human judgment, leading to systematic bias against certain model improvements. On the other hand, averaging human judgments, the unbiased gold standard, is often too expensive. In this paper, we use control variates to combine automatic metrics with human evaluation to obtain an unbiased estimator with lower cost than human evaluation alone. In practice, however, we obtain only a 7-13% cost reduction on evaluating summarization and open-response question answering systems. We then prove that our estimator is optimal: there is no unbiased estimator with lower cost. Our theory further highlights the two fundamental bottlenecks-the automatic metric and the prompt shown to human evaluators-both of which need to be improved to obtain greater cost savings.
引用
收藏
页码:643 / 653
页数:11
相关论文
共 50 条
  • [1] A Study of Automatic Metrics for the Evaluation of Natural Language Explanations
    Clinciu, Miruna-Adriana
    Eshghi, Arash
    Hastie, Helen
    [J]. 16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 2376 - 2387
  • [2] Natural Language Generation, its Evaluation and Metrics
    Gehrmann, Sebastian
    Adewumi, Tosin
    Aggarwal, Karmanya
    Ammanamanchi, Pawan Sasanka
    Anuoluwapo, Aremu
    Bosselut, Antoine
    Chandu, Khyathi Raghavi
    Clinciu, Miruna
    Das, Dipanjan
    Dhole, Kaustubh D.
    Du, Wanyu
    Durmus, Esin
    Gangal, Varun
    Garbacea, Cristina
    Hashimoto, Tatsunori
    Hou, Yufang
    Jernite, Yacine
    Jhamtani, Harsh
    Ji, Yangfeng
    Jolly, Shailza
    Kale, Mihir
    Kumar, Dhruv
    Ladhak, Faisal
    Madaan, Aman
    Maddela, Mounica
    Mahajan, Khyati
    Mahamood, Saad
    Majumder, Bodhisattwa Prasad
    Martins, Pedro Henrique
    McMillan-Major, Angelina
    Mille, Simon
    van Miltenburg, Emiel
    Nadeem, Moin
    Narayan, Shashi
    Nikolaev, Vitaly
    Niyongabo, Rubungo Andre
    Osei, Salomey
    Parikh, Ankur
    Perez-Beltrachini, Laura
    Rao, Niranjan Ramesh
    Raunak, Vikas
    Rodriguez, Juan Diego
    Santhanam, Sashank
    Sedoc, Joao
    Sellam, Thibault
    Shaikh, Samira
    Shimorina, Anastasia
    Sobrevilla Cabezudo, Marco Antonio
    Strobelt, Hendrik
    Subramani, Nishant
    [J]. 1ST WORKSHOP ON NATURAL LANGUAGE GENERATION, EVALUATION, AND METRICS (GEM 2021), 2021, : 96 - 120
  • [3] Towards the Necessity for Debiasing Natural Language Inference Datasets
    Panenghat, Mithun Paul
    Suntwal, Sandeep
    Rafique, Faiz
    Sharp, Rebecca
    Surdeanu, Mihai
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 6883 - 6888
  • [4] Automatic recognition and evaluation of natural language commands
    Majewski, Maciej
    Kacalak, Wojciech
    [J]. ADVANCES IN NEURAL NETWORKS - ISNN 2006, PT 3, PROCEEDINGS, 2006, 3973 : 1155 - 1160
  • [5] MENLI: Robust Evaluation Metrics from Natural Language Inference
    Chen, Yanran
    Eger, Steffen
    [J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2023, 11 : 804 - 825
  • [6] The Glass Ceiling of Automatic Evaluation in Natural Language Generation
    Colombo, Pierre
    Peyrard, Maxime
    Noiry, Nathan
    West, Robert
    Piantanida, Pablo
    [J]. 13TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING AND THE 3RD CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, IJCNLP-AACL 2023, 2023, : 178 - 183
  • [7] LINGO : Visually Debiasing Natural Language Instructions to Support Task Diversity
    Arunkumar, A.
    Sharma, S.
    Agrawal, R.
    Chandrasekaran, S.
    Bryan, C.
    [J]. COMPUTER GRAPHICS FORUM, 2023, 42 (03) : 409 - 421
  • [8] Debiasing Methods in Natural Language Understanding Make Bias More Accessible
    Mendelson, Michael
    Belinkov, Yonatan
    [J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 1545 - 1557
  • [9] Automatic Extraction of Legal Norms: Evaluation of Natural Language Processing Tools
    Ferraro, Gabriela
    Lam, Ho-Pun
    Tosatto, Silvano Colombo
    Olivieri, Francesco
    Islam, Mohammad Badiul
    van Beest, Nick
    Governatori, Guido
    [J]. NEW FRONTIERS IN ARTIFICIAL INTELLIGENCE, JSAI-ISAI 2019, 2020, 12331 : 64 - 81
  • [10] Towards Stable Natural Language Understanding via Information Entropy Guided Debiasing
    Du, Li
    Ding, Xiao
    Sun, Zhouhao
    Liu, Ting
    Qin, Bing
    Liu, Jingshuo
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 2868 - 2882