Natural Language to Code Generation in Interactive Data Science Notebooks

被引:0
|
作者
Yin, Pengcheng [1 ]
Li, Wen-Ding [1 ]
Xiao, Kefan [1 ]
Rao, Abhishek [1 ]
Wen, Yeming [1 ]
Shi, Kensen [1 ]
Howland, Joshua [1 ]
Bailey, Paige [1 ]
Catasta, Michele [1 ]
Michalewski, Henryk [1 ]
Polozov, Alex [1 ]
Sutton, Charles [1 ]
机构
[1] Google Inc, Mountain View, CA 94043 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that automatically synthesize programs for those tasks given natural language (NL) intents from users, we build ARCADE, a benchmark of 1,078 code generation problems using the pandas data analysis framework in data science notebooks. ARCADE features multiple rounds of NL-to-code problems from the same notebook. It requires a model to understand rich multi-modal contexts, such as existing notebook cells and their execution states as well as previous turns of interaction. To establish a strong baseline on this challenging task, we develop PACHINCO, a 62B code language model (LM) for Python computational notebooks, which significantly outperforms public code LMs. Finally, we explore few-shot prompting strategies to elicit better code with step-by-step decomposition and NL explanations, showing the potential to improve the diversity and explainability of model predictions. ARCADE is publicly available at https://github.com/google-research/arcade-nl2code/.
引用
收藏
页码:126 / 173
页数:48
相关论文
共 50 条
  • [31] Agents for Data Science: From Raw Data to AI-Generated Notebooks using LLMs and Code Execution (Invited Talk)
    Cai, Jiahao
    PROCEEDINGS OF THE 1ST ACM INTERNATIONAL CONFERENCE ON AI-POWERED SOFTWARE, AIWARE 2024, 2024, : 181 - 181
  • [32] Computational Intelligence Model for Code Generation from Natural Language Problem Statement
    Kulkarni, A. B.
    Karandikar, S. S.
    Bamhore, P. A.
    Gawade, S. R.
    Medhane, D. V.
    2018 FOURTH INTERNATIONAL CONFERENCE ON COMPUTING COMMUNICATION CONTROL AND AUTOMATION (ICCUBEA), 2018,
  • [33] Documentation Matters: Human-Centered AI System to Assist Data Science Code Documentation in Computational Notebooks
    Wang, April Yi
    Wang, Dakuo
    Drozdal, Jaimie
    Muller, Michael
    Park, Soya
    Weisz, Justin D.
    Liu, Xuye
    Wu, Lingfei
    Dugan, Casey
    ACM TRANSACTIONS ON COMPUTER-HUMAN INTERACTION, 2022, 29 (05)
  • [34] Documentation Matters: Human-Centered AI System to Assist Data Science Code Documentation in Computational Notebooks
    Wang, April Yi
    Wang, Dakuo
    Drozdal, Jaimie
    Muller, Michael
    Park, Soya
    Weisz, Justin D.
    Liu, Xuye
    Wu, Lingfei
    Dugan, Casey
    ACM TRANSACTIONS ON COMPUTER-HUMAN INTERACTION, 2022, 29 (02)
  • [35] SemFORMS: Automatic Generation of Semantic Transforms By Mining Data Science Code
    Abdelaziz, Ibrahim
    Dolby, Julian
    Khurana, Udayan
    Samulowitz, Horst
    Srinivas, Kavitha
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 7106 - 7109
  • [36] Language, Thinking, Code: Interactive Essays with Twine
    Tirto, Darren
    Hamme, Alexander
    O'Hara, Keith J.
    Anderson, Sven
    SIGCSE'18: PROCEEDINGS OF THE 49TH ACM TECHNICAL SYMPOSIUM ON COMPUTER SCIENCE EDUCATION, 2018, : 1078 - 1078
  • [37] Interactive Earth system data cube visualization in Jupyter notebooks
    Sochting, Maximilian
    Scheuermann, Gerik
    Montero, David
    Mahecha, Miguel D.
    BIG EARTH DATA, 2025,
  • [38] A Natural Language Interface for Dissemination of Reproducible Biomedical Data Science
    John, Rogers Jeffrey Leo
    Patel, Jignesh M.
    Alexander, Andrew L.
    Singh, Vikas
    Adluru, Nagesh
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2018, PT IV, 2018, 11073 : 197 - 205
  • [39] Exploiting Inactive Examples for Natural Language Generation With Data Rejuvenation
    Jiao, Wenxiang
    Wang, Xing
    He, Shilin
    Tu, Zhaopeng
    King, Irwin
    Lyu, Michael R.
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 931 - 943
  • [40] Natural Language Processing Model Compiling Natural Language into Byte Code
    Trifan, Alexandru
    Anghelus, Marilena
    Constantinescu, Rodica
    2017 INTERNATIONAL CONFERENCE ON SPEECH TECHNOLOGY AND HUMAN-COMPUTER DIALOGUE (SPED), 2017,