CodeT5+: Open Code Large Language Models for Code Understanding and Generation

被引:0
|
作者
Wang, Yue [1 ]
Le, Hung [1 ]
Gotmare, Akhilesh Deepak [1 ]
Bui, Nghi D. Q. [1 ]
Li, Junnan [1 ]
Hoi, Steven C. H. [1 ]
机构
[1] Salesforce AI Res, San Francisco, CA 94105 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. However, existing code LLMs have two main limitations. First, they often adopt a specific architecture (encoder-only or decoder-only) or rely on a unified encoder-decoder network for different downstream tasks, lacking the flexibility to operate in the optimal architecture for a specific task. Secondly, they often employ a limited set of pretraining objectives which might not be relevant to some tasks and hence result in substantial performance degrade. To address these limitations, we propose "CodeT5+", a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of code tasks. Such flexibility is enabled by our proposed mixture of pretraining objectives, which cover span denoising, contrastive learning, text-code matching, and causal LM pretraining tasks, on both unimodal and bimodal multilingual code corpora. Furthermore, we propose to initialize CodeT5+ with frozen off-the-shelf LLMs without training from scratch to efficiently scale up our models, and explore instruction-tuning to align with natural language instructions. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning. We observe state-of-the-art (SoTA) performance on various code-related tasks, and our instruction-tuned CodeT5+ 16B achieves new SoTA results of 35.0% pass@1 and 54.5% pass@10 on the HumanEval code generation task against other open code LLMs, even surpassing the OpenAI code-cushman-001 model.
引用
收藏
页码:1069 / 1088
页数:20
相关论文
共 50 条
  • [1] GREEN-CODE: Optimizing Energy Efficiency in Large Language Models for Code Generation
    Ilager, Shashikant
    Briem, Lukas Florian
    Brandic, Ivona
    arXiv,
  • [2] CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
    Wang, Yue
    Wang, Weishi
    Joty, Shafiq
    Hoi, Steven C. H.
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 8696 - 8708
  • [3] A Comparative Analysis of Large Language Models for Code Documentation Generation
    Dvivedi, Shubhang Shekhar
    Vijay, Vyshnav
    Pujari, Sai Leela Rahul
    Lodh, Shoumik
    Kumar, Dhruv
    PROCEEDINGS OF THE 1ST ACM INTERNATIONAL CONFERENCE ON AI-POWERED SOFTWARE, AIWARE 2024, 2024, : 65 - 73
  • [4] BioCoder: a benchmark for bioinformatics code generation with large language models
    Tang, Xiangru
    Qian, Bill
    Gao, Rick
    Chen, Jiakang
    Chen, Xinyun
    Gerstein, Mark B.
    BIOINFORMATICS, 2024, 40 : i266 - i276
  • [5] Knowledge-Aware Code Generation with Large Language Models
    Huang, Tao
    Sun, Zhihong
    Jin, Zhi
    Li, Ge
    Lyu, Chen
    PROCEEDINGS 2024 32ND IEEE/ACM INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION, ICPC 2024, 2024, : 52 - 63
  • [6] Self-Planning Code Generation with Large Language Models
    Jiang, Xue
    Dong, Yihong
    Wang, Lecheng
    Fang, Zheng
    Shang, Qiwei
    Li, Ge
    Jin, Zhi
    Jiao, Wenpin
    ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 2024, 33 (07)
  • [7] Framework for evaluating code generation ability of large language models
    Yeo, Sangyeop
    Ma, Yu-Seung
    Kim, Sang Cheol
    Jun, Hyungkook
    Kim, Taeho
    ETRI JOURNAL, 2024, 46 (01) : 106 - 117
  • [8] The (ab)use of Open Source Code to Train Large Language Models
    Al-Kaswan, Ali
    Izadi, Maliheh
    2023 IEEE/ACM 2ND INTERNATIONAL WORKSHOP ON NATURAL LANGUAGE-BASED SOFTWARE ENGINEERING, NLBSE, 2023, : 9 - 10
  • [9] ChemGen: Towards Understanding First-Principles Calculation Code Generation Based on Large Language Models
    Gao, Peng
    Qiu, Feng
    Hua, Baojian
    PROCEEDINGS OF 2024 3RD INTERNATIONAL CONFERENCE ON CYBER SECURITY, ARTIFICIAL INTELLIGENCE AND DIGITAL ECONOMY, CSAIDE 2024, 2024, : 281 - 287
  • [10] ARCHCODE: Incorporating Software Requirements in Code Generation with Large Language Models
    Han, Hojae
    Kim, Jaejin
    Yoo, Jaeseok
    Lee, Youngwon
    Hwang, Seung-won
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 13520 - 13552