Re-evaluating Evaluation

被引：0

作者：

Balduzzi, David ^{[1
]}

Tuyls, Karl ^{[1
]}

Perolat, Julien ^{[1
]}

Graepel, Thore ^{[1
]}

机构：

[1] DeepMind, London, England

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018) | 2018年 / 31卷

关键词：

ARCADE LEARNING-ENVIRONMENT; INTELLIGENCE; PERFORMANCE; GAME; DYNAMICS; PAPER; GO;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Progress in machine learning is measured by careful evaluation on problems of outstanding common interest. However, the proliferation of benchmark suites and environments, adversarial attacks, and other complications has diluted the basic evaluation model by overwhelming researchers with choices. Deliberate or accidental cherry picking is increasingly likely, and designing well-balanced evaluation suites requires increasing effort. In this paper we take a step back and propose Nash averaging. The approach builds on a detailed analysis of the algebraic structure of evaluation in two basic scenarios: agent-vs-agent and agent-vs-task. The key strength of Nash averaging is that it automatically adapts to redundancies in evaluation data, so that results are not biased by the incorporation of easy tasks or weak agents. Nash averaging thus encourages maximally inclusive evaluation - since there is no harm (computational cost aside) from including all available tasks and agents.

引用

页数：12

共 50 条

[1] Re-evaluating Web evaluation
Notess, GR
[J]. ONLINE, 2006, 30 (01): : 45 - 47
[2] Re-evaluating Evaluation in Text Summarization
Bhandari, Manik
Gour, Pranav
Ashfaq, Atabak
Liu, Pengfei
Neubig, Graham
[J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 9347 - 9359
[3] SummEval: Re-evaluating Summarization Evaluation
Fabbri, Alexander R.
Kryscinski, Wojciech
McCann, Bryan
Xiong, Caiming
Socher, Richard
Radev, Dragomir
[J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2021, 9 : 391 - 409
[4] Men in nursing: Re-evaluating masculinities, re-evaluating gender
Brown, Brian
[J]. CONTEMPORARY NURSE, 2009, 33 (02) : 120 - 129
[5] TRUE: Re-evaluating Factual Consistency Evaluation
Honovich, Or
Aharoni, Roee
Herzig, Jonathan
Taitelbaum, Hagai
Cohen, Vered
Kukliansky, Doron
Scialom, Thomas
Szpektor, Idan
Hassidim, Avinatan
Matias, Yossi
[J]. PROCEEDINGS OF THE SECOND DIALDOC WORKSHOP ON DOCUMENT-GROUNDED DIALOGUE AND CONVERSATIONAL QUESTION ANSWERING (DIALDOC 2022), 2022, : 161 - 175
[6] TRUE: Re-evaluating Factual Consistency Evaluation
Honovich, Or
Aharoni, Roee
Herzig, Jonathan
Taitelbaum, Hagai
Cohen, Vered
Kukliansky, Doron
Scialom, Thomas
Szpektor, Idan
Hassidim, Avinatan
Matias, Yossi
[J]. NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 3905 - 3920
[7] Re-evaluating student evaluation of teaching: The teaching evaluation form
Wolfer, TA
Johnson, MM
[J]. JOURNAL OF SOCIAL WORK EDUCATION, 2003, 39 (01) : 111 - 121
[8] Re-evaluating the Anthropocene
Dalby, Simon
[J]. ANTIQUITY, 2016, 90 (350) : 514 - 515
[9] Re-Evaluating "Community"
O'Donnell, Kathleen M.
[J]. ARCHITECT, 2018, 107 (12): : 59 - 59
[10] RE-EVALUATING THE REVEL
不详
[J]. GAMING LAW REVIEW-ECONOMICS REGULATION COMPLIANCE AND POLICY, 2012, 16 (11): : 635 - 635

← 1 2 3 4 5 →