Perturbation CheckLists for Evaluating NLG Evaluation Metrics

被引:0
|
作者
Sai, Ananya B. [1 ,2 ]
Dixit, Tanay [1 ]
Sheth, Dev Yashpal [1 ,2 ]
Mohan, Sreyas [3 ]
Khapra, Mitesh M. [1 ,2 ,4 ]
机构
[1] Indian Inst Technol, Madras, Tamil Nadu, India
[2] IIT Madras, Robert Bosch Ctr Data Sci & Artificial Intelligen, Madras, Tamil Nadu, India
[3] NYU, Ctr Data Sci, New York, NY 10003 USA
[4] AI4Bharat, Chennai, Tamil Nadu, India
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Natural Language Generation (NLG) evaluation is a multifaceted task requiring assessment of multiple desirable criteria, e.g., fluency, coherency, coverage, relevance, adequacy, overall quality, etc. Across existing datasets for 6 NLG tasks, we observe that the human evaluation scores on these multiple criteria are often not correlated. For example, there is a very low correlation between human scores on fluency and data coverage for the task of structured data to text generation. This suggests that the current recipe of proposing new automatic evaluation metrics for NLG by showing that they correlate well with scores assigned by humans for a single criteria (overall quality) alone is inadequate. Indeed, our extensive study involving 25 automatic evaluation metrics across 6 different tasks and 18 different evaluation criteria shows that there is no single metric which correlates well with human scores on all desirable criteria, for most NLG tasks. Given this situation, we propose CheckLists for better design and evaluation of automatic metrics. We design templates which target a specific criteria (e.g., coverage) and perturb the output such that the quality gets affected only along this specific criteria (e.g., the coverage drops). We show that existing evaluation metrics are not robust against even such simple perturbations and disagree with scores assigned by humans to the perturbed output. The proposed templates thus allow for a fine-grained assessment of automatic evaluation metrics exposing their limitations and will facilitate better design, analysis and evaluation of such metrics.(1)
引用
收藏
页码:7219 / 7234
页数:16
相关论文
共 50 条
  • [1] Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory
    Xiao, Ziang
    Zhang, Susu
    Lai, Vivian
    Liao, Q. Vera
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 10967 - 10982
  • [2] Evaluating quantifiable metrics for hospital green checklists
    Unger, Scott R.
    Campion, Nicole
    Bilec, Melissa M.
    Landis, Amy E.
    JOURNAL OF CLEANER PRODUCTION, 2016, 127 : 134 - 142
  • [3] A Survey of Evaluation Metrics Used for NLG Systems
    Sai, Ananya B.
    Mohankumar, Akash Kumar
    Khapra, Mitesh M.
    ACM COMPUTING SURVEYS, 2023, 55 (02)
  • [4] NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist
    Ni'mah, Iftitahu
    Fang, Meng
    Menkovski, Vlado
    Pechenizkiy, Mykola
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 1240 - 1266
  • [5] Code-Mixed NLG: Resources, Metrics, and Challenges
    Srivastava, Vivek
    Singh, Mayank
    PROCEEDINGS OF THE 5TH JOINT INTERNATIONAL CONFERENCE ON DATA SCIENCE & MANAGEMENT OF DATA, CODS COMAD 2022, 2022, : 328 - 332
  • [6] Metrics for MT evaluation: evaluating reordering
    Birch, Alexandra
    Osborne, Miles
    Blunsom, Phil
    MACHINE TRANSLATION, 2010, 24 (01) : 15 - 26
  • [7] SINotas: the Evaluation of a NLG Application
    de Araujo, Roberto P. A.
    de Oliveira, Rafael L.
    de Novais, Eder M.
    Tadeu, Thiago D.
    Pereira, Daniel B.
    Paraboni, Ivandre
    LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : 2388 - 2391
  • [8] Checklists for Evaluating the Quality of Recordkeeping
    Borglund, Erik
    Anderson, Karen
    PROCEEDINGS OF THE 4TH EUROPEAN CONFERENCE ON INFORMATION MANAGEMENT AND EVALUATION, 2010, : 15 - 23
  • [9] Evaluating an NLG System using Post-Editing
    Sripada, Somayajulu G.
    Reiter, Ehud
    Hawizy, Lezan
    19TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI-05), 2005, : 1700 - 1701
  • [10] APPLS: Evaluating Evaluation Metrics for Plain Language Summarization
    Guo, Yue
    August, Tal
    Leroy, Gondy
    Cohen, Trevor
    Wang, Lucy Lu
    arXiv, 2023,