The (ab)use of Open Source Code to Train Large Language Models

被引：4

作者：

Al-Kaswan, Ali ^{[1
]}

Izadi, Maliheh ^{[1
]}

机构：

[1] Delft Univ Technol, Delft, Netherlands

来源：

2023 IEEE/ACM 2ND INTERNATIONAL WORKSHOP ON NATURAL LANGUAGE-BASED SOFTWARE ENGINEERING, NLBSE | 2023年

关键词：

D O I：

10.1109/NLBSE59153.2023.00008

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In recent years, Large Language Models (LLMs) have gained significant popularity due to their ability to generate human-like text and their potential applications in various fields, such as Software Engineering. LLMs for Code are commonly trained on large unsanitized corpora of source code scraped from the Internet. The content of these datasets is memorized and emitted by the models, often in a verbatim manner. In this work, we will discuss the security, privacy, and licensing implications of memorization. We argue why the use of copyleft code to train LLMs is a legal and ethical dilemma. Finally, we provide four actionable recommendations to address this issue.

引用

页码：9 / 10

页数：2

共 50 条

[21] SGL: A domain-specific language for large-scale analysis of open-source code
Foo, Darius
Yi, Ang Ming
Yeo, Jason
Sharma, Asankhaya
2018 IEEE CYBERSECURITY DEVELOPMENT CONFERENCE (SECDEV 2018), 2018, : 61 - 68
[22] Can large language models generate geospatial code?
State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, China
不详
arXiv, 1600,
[23] Code Soliloquies for Accurate Calculations in Large Language Models
Sonkar, Shashank
Chen, Xinghe
Le, MyCo
Liu, Naiming
Mallick, Debshila Basu
Baraniuk, Richard G.
FOURTEENTH INTERNATIONAL CONFERENCE ON LEARNING ANALYTICS & KNOWLEDGE, LAK 2024, 2024, : 828 - 835
[24] Analyzing Declarative Deployment Code with Large Language Models
Lanciano, Giacomo
Stein, Manuel
Hilt, Volker
Cucinotta, Tommaso
PROCEEDINGS OF THE 13TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND SERVICES SCIENCE, CLOSER 2023, 2023, : 289 - 296
[25] OCTOPACK: INSTRUCTION TUNING CODE LARGE LANGUAGE MODELS
Muennighoff, Niklas
Liu, Qian
Zebaze, Armel
Zheng, Qinkai
Hui, Binyuan
Zhuo, Terry Yue
Singh, Swayam
Tang, Xiangru
von Werra, Leandro
Longpre, Shayne
arXiv, 2023,
[26] Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code
Karampatsis, Rafael-Michael
Babii, Hlib
Robbes, Romain
Sutton, Charles
Janes, Andrea
2020 ACM/IEEE 42ND INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2020), 2020, : 1073 - 1085
[27] Large Language Models for Code Obfuscation Evaluation of the Obfuscation Capabilities of OpenAI's GPT-3.5 on C Source Code
Kochberger, Patrick
Gramberger, Maximilian
Schrittwieser, Sebastian
Lawitschka, Caroline
Weippl, Edgar R.
PROCEEDINGS OF THE 20TH INTERNATIONAL CONFERENCE ON SECURITY AND CRYPTOGRAPHY, SECRYPT 2023, 2023, : 7 - 19
[28] Iterative Refactoring of Real-World Open-Source Programs with Large Language Models
Choi, Jinsu
An, Gabin
Yoo, Shin
SEARCH-BASED SOFTWARE ENGINEERING, SSBSE 2024, 2024, 14767 : 49 - 55
[29] Closing the gap between open source and commercial large language models for medical evidence summarization
Zhang, Gongbo
Jin, Qiao
Zhou, Yiliang
Wang, Song
Idnay, Betina
Luo, Yiming
Park, Elizabeth
Nestor, Jordan G.
Spotnitz, Matthew E.
Soroush, Ali
Campion Jr, Thomas R.
Lu, Zhiyong
Weng, Chunhua
Peng, Yifan
NPJ DIGITAL MEDICINE, 2024, 7 (01):
[30] Evaluation of Open-Source Large Language Models for Metal-Organic Frameworks Research
Bai, Xuefeng
Xie, Yabo
Zhang, Xin
Han, Honggui
Li, Jian-Rong
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2024, 64 (13) : 4958 - 4965

← 1 2 3 4 5 →