DataPerf: Benchmarks for Data-Centric AI Development

被引:0
|
作者
Mazumder, Mark [1 ]
Banbury, Colby [1 ]
Yao, Xiaozhe [2 ]
Karlas, Bojan [2 ]
Rojas, William Gaviria [3 ]
Diamos, Sudnya [3 ]
Diamos, Greg [4 ]
He, Lynn [5 ]
Parrish, Alicia [8 ]
Kirk, Hannah Rose [16 ]
Quaye, Jessica [1 ]
Rastogi, Charvi [11 ]
Kiela, Douwe [9 ,20 ]
Jurado, David [6 ,19 ]
Kanter, David [6 ]
Mosquera, Rafael [6 ,19 ]
Ciro, Juan [6 ,19 ]
Aroyo, Lora [8 ]
Acun, Bilge [7 ]
Chen, Lingjiao [9 ]
Raje, Mehul Smriti [3 ]
Bartolo, Max [15 ,18 ]
Eyuboglu, Sabri [9 ]
Ghorbani, Amirata [9 ]
Goodman, Emmett [9 ]
Inel, Oana [17 ]
Kane, Tariq [3 ,8 ]
Kirkpatrick, Christine R. [10 ]
Kuo, Tzu-Sheng [11 ]
Mueller, Jonas [12 ]
Thrush, Tristan [9 ]
Vanschoren, Joaquin [13 ]
Warren, Margaret [14 ]
Williams, Adina [7 ]
Yeung, Serena [9 ]
Ardalani, Newsha [7 ]
Paritosh, Praveen [6 ]
Zhang, Ce [2 ]
Zou, James [9 ]
Wu, Carole-Jean [7 ]
Coleman, Cody [3 ]
Ng, Andrew [4 ,5 ,9 ]
Mattson, Peter [8 ]
Reddi, Vijay Janapa [1 ]
机构
[1] Harvard Univ, Cambridge, MA 02138 USA
[2] Swiss Fed Inst Technol, Zurich, Switzerland
[3] Coactive AI, San Jose, CA USA
[4] Landing AI, Palo Alto, CA USA
[5] DeepLearning AI, Palo Alto, CA USA
[6] MLCommons, San Francisco, CA USA
[7] Meta, Menlo Pk, CA USA
[8] Google, Mountain View, CA 94043 USA
[9] Stanford Univ, Stanford, CA 94305 USA
[10] Univ Calif San Diego, San Diego Supercomp Ctr, La Jolla, CA USA
[11] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[12] Cleanlab, San Francisco, CA USA
[13] Eindhoven Univ Technol, Eindhoven, Netherlands
[14] Inst Human & Machine Cognit, Pensacola, FL USA
[15] Cohere, Toronto, ON, Canada
[16] Univ Oxford, Oxford, England
[17] Univ Zurich, Zurich, Switzerland
[18] UCL, London, England
[19] Factored, Palo Alto, CA USA
[20] Contextual AI, Mountain View, CA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Machine learning research has long focused on models rather than datasets, and prominent datasets are used for common ML tasks without regard to the breadth, difficulty, and faithfulness of the underlying problems. Neglecting the fundamental importance of data has given rise to inaccuracy, bias, and fragility in real-world applications, and research is hindered by saturation across existing dataset benchmarks. In response, we present DataPerf, a community-led benchmark suite for evaluating ML datasets and data-centric algorithms. We aim to foster innovation in data-centric AI through competition, comparability, and reproducibility. We enable the ML community to iterate on datasets, instead of just architectures, and we provide an open, online platform with multiple rounds of challenges to support this iterative development. The first iteration of DataPerf contains five benchmarks covering a wide spectrum of data-centric techniques, tasks, and modalities in vision, speech, acquisition, debugging, and diffusion prompting, and we support hosting new contributed benchmarks from the community. The benchmarks, online evaluation platform, and baseline implementations are open source, and the MLCommons Association will maintain DataPerf to ensure long-term benefits to academia and industry.
引用
收藏
页数:28
相关论文
共 50 条
  • [1] Data-Centric AI
    Malerba, Donato
    Pasquadibisceglie, Vincenzo
    [J]. JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2024,
  • [2] The Principles of Data-Centric AI
    Jarrahi, Mohammad Hossein
    Memariani, Ali
    Guha, Shion
    [J]. COMMUNICATIONS OF THE ACM, 2023, 66 (08) : 84 - 92
  • [3] Data-centric AI: Perspectives and Challenges
    Zha, Daochen
    Bhat, Zaid Pervaiz
    Lai, Kwei-Herng
    Yang, Fan
    Hu, Xia
    [J]. PROCEEDINGS OF THE 2023 SIAM INTERNATIONAL CONFERENCE ON DATA MINING, SDM, 2023, : 945 - 948
  • [4] Opportunities and Challenges in Data-Centric AI
    Kumar, Sushant
    Datta, Sumit
    Singh, Vishakha
    Singh, Sanjay Kumar
    Sharma, Ritesh
    [J]. IEEE ACCESS, 2024, 12 : 33173 - 33189
  • [5] From Concept to Implementation: The Data-Centric Development Process for AI in Industry
    Luley, Paul-Philipp
    Deriu, Jan M.
    Yan, Peng
    Schatte, Gerrit A.
    Stadelmann, Thilo
    [J]. 2023 10TH IEEE SWISS CONFERENCE ON DATA SCIENCE, SDS, 2023, : 73 - 76
  • [6] dcbench: A Benchmark for Data-Centric AI Systems
    Eyuboglu, Sabri
    Karlas, Bojan
    Re, Christopher
    Zhang, Ce
    Zou, James
    [J]. PROCEEDINGS OF THE 6TH WORKSHOP ON DATA MANAGEMENT FOR END-TO-END MACHINE LEARNING, DEEM 2022, 2022,
  • [7] Potential Impact of Data-Centric AI on Society
    Kumar, Sushant
    Sharma, Ritesh
    Singh, Vishakha
    Tiwari, Shrikant
    Singh, Sanjay Kumar
    Datta, Sumit
    [J]. IEEE TECHNOLOGY AND SOCIETY MAGAZINE, 2023, 42 (03) : 98 - 107
  • [8] Data-Centric AI for Healthcare Fraud Detection
    Johnson J.M.
    Khoshgoftaar T.M.
    [J]. SN Computer Science, 4 (4)
  • [9] Data-centric AI: Techniques and Future Perspectives
    Zha, Daochen
    Lai, Kwei-Herng
    Yang, Fan
    Zou, Na
    Gao, Huiji
    Hu, Xia
    [J]. PROCEEDINGS OF THE 29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2023, 2023, : 5839 - 5840
  • [10] GitWorkflow for Active Learning: A Development Methodology Proposal for Data-Centric AI Projects
    Stieler, Fabian
    Bauer, Bernhard
    [J]. PROCEEDINGS OF THE 18TH INTERNATIONAL CONFERENCE ON EVALUATION OF NOVEL APPROACHES TO SOFTWARE ENGINEERING, ENASE 2023, 2023, : 202 - 213