Re-evaluating Evaluation

被引:0
|
作者
Balduzzi, David [1 ]
Tuyls, Karl [1 ]
Perolat, Julien [1 ]
Graepel, Thore [1 ]
机构
[1] DeepMind, London, England
关键词
ARCADE LEARNING-ENVIRONMENT; INTELLIGENCE; PERFORMANCE; GAME; DYNAMICS; PAPER; GO;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Progress in machine learning is measured by careful evaluation on problems of outstanding common interest. However, the proliferation of benchmark suites and environments, adversarial attacks, and other complications has diluted the basic evaluation model by overwhelming researchers with choices. Deliberate or accidental cherry picking is increasingly likely, and designing well-balanced evaluation suites requires increasing effort. In this paper we take a step back and propose Nash averaging. The approach builds on a detailed analysis of the algebraic structure of evaluation in two basic scenarios: agent-vs-agent and agent-vs-task. The key strength of Nash averaging is that it automatically adapts to redundancies in evaluation data, so that results are not biased by the incorporation of easy tasks or weak agents. Nash averaging thus encourages maximally inclusive evaluation - since there is no harm (computational cost aside) from including all available tasks and agents.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Re-evaluating Web evaluation
    Notess, GR
    [J]. ONLINE, 2006, 30 (01): : 45 - 47
  • [2] Re-evaluating Evaluation in Text Summarization
    Bhandari, Manik
    Gour, Pranav
    Ashfaq, Atabak
    Liu, Pengfei
    Neubig, Graham
    [J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 9347 - 9359
  • [3] SummEval: Re-evaluating Summarization Evaluation
    Fabbri, Alexander R.
    Kryscinski, Wojciech
    McCann, Bryan
    Xiong, Caiming
    Socher, Richard
    Radev, Dragomir
    [J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2021, 9 : 391 - 409
  • [4] Men in nursing: Re-evaluating masculinities, re-evaluating gender
    Brown, Brian
    [J]. CONTEMPORARY NURSE, 2009, 33 (02) : 120 - 129
  • [5] TRUE: Re-evaluating Factual Consistency Evaluation
    Honovich, Or
    Aharoni, Roee
    Herzig, Jonathan
    Taitelbaum, Hagai
    Cohen, Vered
    Kukliansky, Doron
    Scialom, Thomas
    Szpektor, Idan
    Hassidim, Avinatan
    Matias, Yossi
    [J]. PROCEEDINGS OF THE SECOND DIALDOC WORKSHOP ON DOCUMENT-GROUNDED DIALOGUE AND CONVERSATIONAL QUESTION ANSWERING (DIALDOC 2022), 2022, : 161 - 175
  • [6] TRUE: Re-evaluating Factual Consistency Evaluation
    Honovich, Or
    Aharoni, Roee
    Herzig, Jonathan
    Taitelbaum, Hagai
    Cohen, Vered
    Kukliansky, Doron
    Scialom, Thomas
    Szpektor, Idan
    Hassidim, Avinatan
    Matias, Yossi
    [J]. NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 3905 - 3920
  • [7] Re-evaluating student evaluation of teaching: The teaching evaluation form
    Wolfer, TA
    Johnson, MM
    [J]. JOURNAL OF SOCIAL WORK EDUCATION, 2003, 39 (01) : 111 - 121
  • [8] Re-evaluating the Anthropocene
    Dalby, Simon
    [J]. ANTIQUITY, 2016, 90 (350) : 514 - 515
  • [9] Re-Evaluating "Community"
    O'Donnell, Kathleen M.
    [J]. ARCHITECT, 2018, 107 (12): : 59 - 59
  • [10] RE-EVALUATING THE REVEL
    不详
    [J]. GAMING LAW REVIEW-ECONOMICS REGULATION COMPLIANCE AND POLICY, 2012, 16 (11): : 635 - 635