CodeT5+: Open Code Large Language Models for Code Understanding and Generation

被引：0

作者：

Wang, Yue ^{[1
]}

Le, Hung ^{[1
]}

Gotmare, Akhilesh Deepak ^{[1
]}

Bui, Nghi D. Q. ^{[1
]}

Li, Junnan ^{[1
]}

Hoi, Steven C. H. ^{[1
]}

机构：

[1] Salesforce AI Res, San Francisco, CA 94105 USA

来源：

2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023 | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. However, existing code LLMs have two main limitations. First, they often adopt a specific architecture (encoder-only or decoder-only) or rely on a unified encoder-decoder network for different downstream tasks, lacking the flexibility to operate in the optimal architecture for a specific task. Secondly, they often employ a limited set of pretraining objectives which might not be relevant to some tasks and hence result in substantial performance degrade. To address these limitations, we propose "CodeT5+", a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of code tasks. Such flexibility is enabled by our proposed mixture of pretraining objectives, which cover span denoising, contrastive learning, text-code matching, and causal LM pretraining tasks, on both unimodal and bimodal multilingual code corpora. Furthermore, we propose to initialize CodeT5+ with frozen off-the-shelf LLMs without training from scratch to efficiently scale up our models, and explore instruction-tuning to align with natural language instructions. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning. We observe state-of-the-art (SoTA) performance on various code-related tasks, and our instruction-tuned CodeT5+ 16B achieves new SoTA results of 35.0% pass@1 and 54.5% pass@10 on the HumanEval code generation task against other open code LLMs, even surpassing the OpenAI code-cushman-001 model.

引用

页码：1069 / 1088

页数：20

共 50 条

[1] GREEN-CODE: Optimizing Energy Efficiency in Large Language Models for Code Generation
Ilager, Shashikant
Briem, Lukas Florian
Brandic, Ivona
arXiv,
[2] CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
Wang, Yue
Wang, Weishi
Joty, Shafiq
Hoi, Steven C. H.
2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 8696 - 8708
[3] A Comparative Analysis of Large Language Models for Code Documentation Generation
Dvivedi, Shubhang Shekhar
Vijay, Vyshnav
Pujari, Sai Leela Rahul
Lodh, Shoumik
Kumar, Dhruv
PROCEEDINGS OF THE 1ST ACM INTERNATIONAL CONFERENCE ON AI-POWERED SOFTWARE, AIWARE 2024, 2024, : 65 - 73
[4] BioCoder: a benchmark for bioinformatics code generation with large language models
Tang, Xiangru
Qian, Bill
Gao, Rick
Chen, Jiakang
Chen, Xinyun
Gerstein, Mark B.
BIOINFORMATICS, 2024, 40 : i266 - i276
[5] Knowledge-Aware Code Generation with Large Language Models
Huang, Tao
Sun, Zhihong
Jin, Zhi
Li, Ge
Lyu, Chen
PROCEEDINGS 2024 32ND IEEE/ACM INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION, ICPC 2024, 2024, : 52 - 63
[6] Self-Planning Code Generation with Large Language Models
Jiang, Xue
Dong, Yihong
Wang, Lecheng
Fang, Zheng
Shang, Qiwei
Li, Ge
Jin, Zhi
Jiao, Wenpin
ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 2024, 33 (07)
[7] Framework for evaluating code generation ability of large language models
Yeo, Sangyeop
Ma, Yu-Seung
Kim, Sang Cheol
Jun, Hyungkook
Kim, Taeho
ETRI JOURNAL, 2024, 46 (01) : 106 - 117
[8] The (ab)use of Open Source Code to Train Large Language Models
Al-Kaswan, Ali
Izadi, Maliheh
2023 IEEE/ACM 2ND INTERNATIONAL WORKSHOP ON NATURAL LANGUAGE-BASED SOFTWARE ENGINEERING, NLBSE, 2023, : 9 - 10
[9] ChemGen: Towards Understanding First-Principles Calculation Code Generation Based on Large Language Models
Gao, Peng
Qiu, Feng
Hua, Baojian
PROCEEDINGS OF 2024 3RD INTERNATIONAL CONFERENCE ON CYBER SECURITY, ARTIFICIAL INTELLIGENCE AND DIGITAL ECONOMY, CSAIDE 2024, 2024, : 281 - 287
[10] ARCHCODE: Incorporating Software Requirements in Code Generation with Large Language Models
Han, Hojae
Kim, Jaejin
Yoo, Jaeseok
Lee, Youngwon
Hwang, Seung-won
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 13520 - 13552

← 1 2 3 4 5 →