Natural Language to Code Generation in Interactive Data Science Notebooks

被引:0
|
作者
Yin, Pengcheng [1 ]
Li, Wen-Ding [1 ]
Xiao, Kefan [1 ]
Rao, Abhishek [1 ]
Wen, Yeming [1 ]
Shi, Kensen [1 ]
Howland, Joshua [1 ]
Bailey, Paige [1 ]
Catasta, Michele [1 ]
Michalewski, Henryk [1 ]
Polozov, Alex [1 ]
Sutton, Charles [1 ]
机构
[1] Google Inc, Mountain View, CA 94043 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that automatically synthesize programs for those tasks given natural language (NL) intents from users, we build ARCADE, a benchmark of 1,078 code generation problems using the pandas data analysis framework in data science notebooks. ARCADE features multiple rounds of NL-to-code problems from the same notebook. It requires a model to understand rich multi-modal contexts, such as existing notebook cells and their execution states as well as previous turns of interaction. To establish a strong baseline on this challenging task, we develop PACHINCO, a 62B code language model (LM) for Python computational notebooks, which significantly outperforms public code LMs. Finally, we explore few-shot prompting strategies to elicit better code with step-by-step decomposition and NL explanations, showing the potential to improve the diversity and explainability of model predictions. ARCADE is publicly available at https://github.com/google-research/arcade-nl2code/.
引用
收藏
页码:126 / 173
页数:48
相关论文
共 50 条
  • [1] Interactive notebooks: Sharing the code
    Helen Shen
    Nature, 2014, 515 : 152 - 152
  • [2] INTERACTIVE NOTEBOOKS: SHARING THE CODE
    Shen, Helen
    NATURE, 2014, 515 (7525) : 151 - 152
  • [3] Interactive notebooks: Sharing the code
    Helen Shen
    Nature, 2014, 515 : 151 - 152
  • [4] Data science through natural language with ChatGPT's Code Interpreter
    Ahn, Sangzin
    TRANSLATIONAL AND CLINICAL PHARMACOLOGY, 2024, 32 (02) : 73 - 82
  • [5] Code Generation from Natural Language with Less Prior and More Monolingual Data
    Norouzi, Sajad
    Tang, Keyi
    Cao, Yanshuai
    ACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 2021, 2 : 776 - 785
  • [6] Code Generation from Natural Language with Less Prior and More Monolingual Data
    Norouzi, Sajad
    Tang, Keyi
    Cao, Yanshuai
    ACL-IJCNLP 2021: THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 2, 2021, : 776 - 785
  • [7] An Interactive Scene Generation Using Natural Language
    Cheng, Yu
    Shi, Yan
    Sun, Zhiyong
    Feng, Dezhi
    Dong, Lixin
    2019 INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2019, : 6957 - 6963
  • [8] Code generation from natural language with less prior and more monolingual data
    Norouzi, Sajad
    Tang, Keyi
    Cao, Yanshuai
    arXiv, 2021,
  • [9] Code Code Evolution: Understanding How People Change Data Science Notebooks Over Time
    Raghunandan, Deepthi
    Roy, Aayushi
    Shi, Shenzhi
    Elmqvist, Niklas
    Battle, Leilani
    PROCEEDINGS OF THE 2023 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, CHI 2023, 2023,
  • [10] A Survey of Automatic Code Generation from Natural Language
    Shin, Jiho
    Nam, Jaechang
    JOURNAL OF INFORMATION PROCESSING SYSTEMS, 2021, 17 (03): : 537 - 555