Understanding the Dataset Practitioners Behind Large Language Models

被引:0
|
作者
Qian, Crystal [1 ]
Reif, Emily [2 ]
Kahng, Minsuk [3 ]
机构
[1] Google Res, New York, NY 10011 USA
[2] Google Res, Seattle, WA USA
[3] Google Res, Atlanta, GA USA
关键词
D O I
10.1145/3613905.3651007
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
As large language models (LLMs) become more advanced and impactful, it is increasingly important to scrutinize the data that they rely upon and produce. What is it to be a dataset practitioner doing this work? We approach this in two parts: first, we define the role of "dataset practitioners" by performing a retrospective analysis on the responsibilities of teams contributing to LLM development at a technology company, Google. Then, we conduct semi-structured interviews with a cross-section of these practitioners (N=10). We find that although data quality is a top priority, there is little consensus around what data quality is and how to evaluate it. Consequently, practitioners either rely on their own intuition or write custom code to evaluate their data. We discuss potential reasons for this phenomenon and opportunities for alignment.
引用
收藏
页数:7
相关论文
共 50 条
  • [1] Causal Dataset Discovery with Large Language Models
    Liu, Junfei
    Sun, Shaotong
    Nargesian, Fatemeh
    [J]. WORKSHOP ON HUMAN-IN-THE-LOOP DATA ANALYTICS, HILDA 2024, 2024,
  • [2] The Importance of Understanding Language in Large Language Models
    Youssef, Alaa
    Stein, Samantha
    Clapp, Justin
    Magnus, David
    [J]. AMERICAN JOURNAL OF BIOETHICS, 2023, 23 (10): : 6 - 7
  • [3] Understanding Telecom Language Through Large Language Models
    Bariah, Lina
    Zou, Hang
    Zhao, Qiyang
    Mouhouche, Belkacem
    Bader, Faouzi
    Debbah, Merouane
    [J]. IEEE CONFERENCE ON GLOBAL COMMUNICATIONS, GLOBECOM, 2023, : 6542 - 6547
  • [4] Understanding political polarization using language models: A dataset and method
    Gode, Samiran
    Bare, Supreeth
    Raj, Bhiksha
    Yoo, Hyungon
    [J]. AI MAGAZINE, 2023, 44 (03) : 248 - 254
  • [5] Natural Language Dataset Generation Framework for Visualizations Powered by Large Language Models
    Ko, Hyung-Kwon
    Jeon, Hyeon
    Park, Gwanmo
    Kim, Dae Hyun
    Kim, Nam Wook
    Kim, Juho
    Seo, Jinwook
    [J]. PROCEEDINGS OF THE 2024 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYTEMS (CHI 2024), 2024,
  • [6] Shortcut Learning of Large Language Models in Natural Language Understanding
    Du, Mengnan
    He, Fengxiang
    Zou, Na
    Tao, Dacheng
    Hu, Xia
    [J]. COMMUNICATIONS OF THE ACM, 2024, 67 (01) : 110 - 120
  • [7] Understanding natural language: Potential application of large language models to ophthalmology
    Yang, Zefeng
    Wang, Deming
    Zhou, Fengqi
    Song, Diping
    Zhang, Yinhang
    Jiang, Jiaxuan
    Kong, Kangjie
    Liu, Xiaoyi
    Qiao, Yu
    Chang, Robert T.
    Han, Ying
    Li, Fei
    Tham, Clement C.
    Zhang, Xiulan
    [J]. ASIA-PACIFIC JOURNAL OF OPHTHALMOLOGY, 2024, 13 (04):
  • [8] MISGENDERED: Limits of Large Language Models in Understanding Pronouns
    Hossain, Tamanna
    Dev, Sunipa
    Singh, Sameer
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 5352 - 5367
  • [9] Understanding large language models: A guide for dental professionals
    Tussie, Camila
    [J]. JOURNAL OF DENTAL EDUCATION, 2024, 88 (02) : 190 - 192
  • [10] Research on Dataset Generation in the Development of Large Language Models for Digital Textbooks
    Lee, Youngho
    [J]. 2023 3RD INTERNATIONAL CONFERENCE ON ROBOTICS, AUTOMATION AND ARTIFICIAL INTELLIGENCE, RAAI 2023, 2023, : 297 - 300