The (ab)use of Open Source Code to Train Large Language Models

被引：4

作者：

Al-Kaswan, Ali ^{[1
]}

Izadi, Maliheh ^{[1
]}

机构：

[1] Delft Univ Technol, Delft, Netherlands

来源：

2023 IEEE/ACM 2ND INTERNATIONAL WORKSHOP ON NATURAL LANGUAGE-BASED SOFTWARE ENGINEERING, NLBSE | 2023年

关键词：

D O I：

10.1109/NLBSE59153.2023.00008

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In recent years, Large Language Models (LLMs) have gained significant popularity due to their ability to generate human-like text and their potential applications in various fields, such as Software Engineering. LLMs for Code are commonly trained on large unsanitized corpora of source code scraped from the Internet. The content of these datasets is memorized and emitted by the models, often in a verbatim manner. In this work, we will discuss the security, privacy, and licensing implications of memorization. We argue why the use of copyleft code to train LLMs is a legal and ethical dilemma. Finally, we provide four actionable recommendations to address this issue.

引用

页码：9 / 10

页数：2

共 50 条

[1] Enhancing Code Security Through Open-Source Large Language Models: A Comparative Study
Ridley, Norah
Branca, Enrico
Kimber, Jadyn
Stakhanova, Natalia
FOUNDATIONS AND PRACTICE OF SECURITY, PT I, FPS 2023, 2024, 14551 : 233 - 249
[2] Comparative Analysis of Large Language Models in Source Code Analysis
Erdoğan, Hüseyin
Turan, Nezihe Turhan
Onan, Aytuğ
Lecture Notes in Networks and Systems, 2024, 1088 LNNS : 185 - 192
[3] Language to Code with Open Source Software
Tang, Lei
Mao, Xiaoguang
Zhang, Zhuo
PROCEEDINGS OF 2019 IEEE 10TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND SERVICE SCIENCE (ICSESS 2019), 2019, : 561 - 564
[4] Servicing open-source large language models for oncology
Ray, Partha Pratim
ONCOLOGIST, 2024,
[5] Benchmarking Causal Study to Interpret Large Language Models for Source Code
Rodriguez-Cardenas, Daniel
Palacio, David N.
Khati, Dipin
Burke, Henry
Poshyvanyk, Denys
2023 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION, ICSME, 2023, : 329 - 334
[6] Evaluating Source Code Quality with Large Language Models: a comparative study
da Silva Simões, Igor Regis
Venson, Elaine
arXiv,
[7] Mapping Source Code to Software Architecture by Leveraging Large Language Models
Johansson, Nils
Caporuscio, Mauro
Olsson, Tobias
SOFTWARE ARCHITECTURE, ECSA 2024 TRACKS AND WORKSHOPS, 2024, 14937 : 133 - 149
[8] Requirements Verification Through the Analysis of Source Code by Large Language Models
Couder, Juan Ortiz
Gomez, Dawson
Ochoa, Omar
SOUTHEASTCON 2024, 2024, : 75 - 80
[9] A tutorial on open-source large language models for behavioral science
Hussain, Zak
Binz, Marcel
Mata, Rui
Wulff, Dirk U.
BEHAVIOR RESEARCH METHODS, 2024, : 8214 - 8237
[10] IRSTLM: an Open Source Toolkit for Handling Large Scale Language Models
Federico, Marcello
Bertoldi, Nicola
Cettolo, Mauro
INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 1618 - 1621

← 1 2 3 4 5 →