Natural Language to Code Generation in Interactive Data Science Notebooks

被引:0
|
作者
Yin, Pengcheng [1 ]
Li, Wen-Ding [1 ]
Xiao, Kefan [1 ]
Rao, Abhishek [1 ]
Wen, Yeming [1 ]
Shi, Kensen [1 ]
Howland, Joshua [1 ]
Bailey, Paige [1 ]
Catasta, Michele [1 ]
Michalewski, Henryk [1 ]
Polozov, Alex [1 ]
Sutton, Charles [1 ]
机构
[1] Google Inc, Mountain View, CA 94043 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that automatically synthesize programs for those tasks given natural language (NL) intents from users, we build ARCADE, a benchmark of 1,078 code generation problems using the pandas data analysis framework in data science notebooks. ARCADE features multiple rounds of NL-to-code problems from the same notebook. It requires a model to understand rich multi-modal contexts, such as existing notebook cells and their execution states as well as previous turns of interaction. To establish a strong baseline on this challenging task, we develop PACHINCO, a 62B code language model (LM) for Python computational notebooks, which significantly outperforms public code LMs. Finally, we explore few-shot prompting strategies to elicit better code with step-by-step decomposition and NL explanations, showing the potential to improve the diversity and explainability of model predictions. ARCADE is publicly available at https://github.com/google-research/arcade-nl2code/.
引用
收藏
页码:126 / 173
页数:48
相关论文
共 50 条
  • [21] Data science in light of natural language processing: An overview
    Zeroual, Imad
    Lakhouaja, Abdelhak
    PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING IN DATA SCIENCES (ICDS2017), 2018, 127 : 82 - 91
  • [22] A Static Analysis Framework for Data Science Notebooks
    Subotic, Pavle
    Milikic, Lazar
    Stojic, Milan
    2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: SOFTWARE ENGINEERING IN PRACTICE (ICSE-SEIP 2022), 2022, : 13 - 22
  • [23] Up-cycling Data for Natural Language Generation
    Isard, Amy
    Oberlander, Jon
    Grover, Claire
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3055 - 3061
  • [24] A Repository of Data and Evaluation Resources for Natural Language Generation
    Belz, Anja
    Gatt, Albert
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 4027 - 4032
  • [25] Interactive Data Language
    Stern, BA
    SPACE 2000, PROCEEDINGS, 2000, : 1011 - 1015
  • [26] FREyA: An Interactive Way of Querying Linked Data Using Natural Language
    Damljanovic, Danica
    Agatonovic, Milan
    Cunningham, Hamish
    SEMANTIC WEB: ESWC 2011 WORKSHOPS, 2012, 7117 : 125 - +
  • [27] NLyze: Interactive Programming by Natural Language for SpreadSheet Data Analysis and Manipulation
    Gulwani, Sumit
    Marron, Mark
    SIGMOD'14: PROCEEDINGS OF THE 2014 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2014, : 803 - 814
  • [28] An interactive visualization method of numerical data based on natural language requirements
    Matsushita, M
    Maeda, E
    Kato, T
    INTERNATIONAL JOURNAL OF HUMAN-COMPUTER STUDIES, 2004, 60 (04) : 469 - 488
  • [29] Cracking the natural language code
    Tichy, P
    FILOSOFICKY CASOPIS, 2002, 50 (05): : 786 - 802
  • [30] NL2Code: Harnessing Transformers for Automatic Code Generation from Natural Language Descriptions
    Pavitha, N.
    Patrawala, Alimurtuza
    Kulkarni, Tejas
    Talati, Vidit
    Dahiya, Shubham
    SMART TRENDS IN COMPUTING AND COMMUNICATIONS, VOL 3, SMARTCOM 2024, 2024, 947 : 73 - 83