Unified Pre-training for Program Understanding and Generation

被引:0
|
作者
Ahmad, Wasi Uddin [1 ]
Chakraborty, Saikat [2 ]
Ray, Baishakhi [2 ]
Chang, Kai-Wei [1 ]
机构
[1] Univ Calif Los Angeles, Los Angeles, CA USA
[2] Columbia Univ, New York, NY 10027 USA
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Code summarization and generation empower conversion between programming language (PL) and natural language (NL), while code translation avails the migration of legacy code from one PL to another. This paper introduces PLBART, a sequence-to-sequence model capable of performing a broad spectrum of program and language understanding and generation tasks. PLBART is pre-trained on an extensive collection of Java and Python functions and associated NL text via denoising autoencoding. Experiments on code summarization in the English language, code generation, and code translation in seven programming languages show that PLBART outperforms or rivals state-of-the-art models. Moreover, experiments on discriminative tasks, e.g., program repair, clone detection, and vulnerable code detection, demonstrate PLBART's effectiveness in program understanding. Furthermore, analysis reveals that PLBART learns program syntax, style (e.g., identifier naming convention), logical flow (e.g., if block inside an else block is equivalent to else if block) that are crucial to program semantics and thus excels even with limited annotations.
引用
收藏
页码:2655 / 2668
页数:14
相关论文
共 50 条
  • [1] Unified Language Model Pre-training for Natural Language Understanding and Generation
    Dong, Li
    Yang, Nan
    Wang, Wenhui
    Wei, Furu
    Liu, Xiaodong
    Wang, Yu
    Gao, Jianfeng
    Zhou, Ming
    Hon, Hsiao-Wuen
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [2] Unified Dialog Model Pre-training for Task-Oriented Dialog Understanding and Generation
    He, Wanwei
    Dai, Yinpei
    Yang, Min
    Sun, Jian
    Huang, Fei
    Si, Luo
    Li, Yongbin
    [J]. PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, : 187 - 200
  • [3] BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
    Li, Junnan
    Li, Dongxu
    Xiong, Caiming
    Hoi, Steven
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [4] Multimodal Pre-training Method for Vision-language Understanding and Generation
    Liu, Tian-Yi
    Wu, Zu-Xuan
    Chen, Jing-Jing
    Jiang, Yu-Gang
    [J]. Ruan Jian Xue Bao/Journal of Software, 2023, 34 (05): : 2024 - 2034
  • [5] Understanding tables with intermediate pre-training
    Eisenschlos, Julian Martin
    Krichene, Syrine
    Mueller, Thomas
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020,
  • [6] PreQR: Pre-training Representation for SQL Understanding
    Tang, Xiu
    Wu, Sai
    Song, Mingli
    Ying, Shanshan
    Li, Feifei
    Chen, Gang
    [J]. PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (SIGMOD '22), 2022, : 204 - 216
  • [7] PRE-TRAINING PROGRAM FOR GRADUATE TEACHING ASSISTANTS
    HOUK, CC
    [J]. ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 1969, (APR): : CH06 - +
  • [8] Graph Pre-training for AMR Parsing and Generation
    Bai, Xuefeng
    Chen, Yulong
    Zhang, Yue
    [J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 6001 - 6015
  • [9] VECO: Variable and Flexible Cross-lingual Pre-training for Language Understanding and Generation
    Luo, Fuli
    Wang, Wei
    Liu, Jiahao
    Liu, Yijia
    Bi, Bin
    Huang, Songfang
    Huang, Fei
    Si, Luo
    [J]. 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 3980 - 3994
  • [10] XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation
    Liang, Yaobo
    Duan, Nan
    Gong, Yeyun
    Wu, Ning
    Guo, Fenfei
    Qi, Weizhen
    Gong, Ming
    Shou, Linjun
    Jiang, Daxin
    Cao, Guihong
    Fan, Xiaodong
    Zhang, Ruofei
    Agrawal, Rahul
    Cui, Edward
    Wei, Sining
    Bharti, Taroon
    Qiao, Ying
    Chen, Jiun-Hung
    Wu, Winnie
    Liu, Shuguang
    Yang, Fan
    Campos, Daniel
    Majumder, Rangan
    Zhou, Ming
    [J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 6008 - 6018