The (ab)use of Open Source Code to Train Large Language Models

被引:4
|
作者
Al-Kaswan, Ali [1 ]
Izadi, Maliheh [1 ]
机构
[1] Delft Univ Technol, Delft, Netherlands
关键词
D O I
10.1109/NLBSE59153.2023.00008
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, Large Language Models (LLMs) have gained significant popularity due to their ability to generate human-like text and their potential applications in various fields, such as Software Engineering. LLMs for Code are commonly trained on large unsanitized corpora of source code scraped from the Internet. The content of these datasets is memorized and emitted by the models, often in a verbatim manner. In this work, we will discuss the security, privacy, and licensing implications of memorization. We argue why the use of copyleft code to train LLMs is a legal and ethical dilemma. Finally, we provide four actionable recommendations to address this issue.
引用
收藏
页码:9 / 10
页数:2
相关论文
共 50 条
  • [1] Enhancing Code Security Through Open-Source Large Language Models: A Comparative Study
    Ridley, Norah
    Branca, Enrico
    Kimber, Jadyn
    Stakhanova, Natalia
    FOUNDATIONS AND PRACTICE OF SECURITY, PT I, FPS 2023, 2024, 14551 : 233 - 249
  • [2] Comparative Analysis of Large Language Models in Source Code Analysis
    Erdoğan, Hüseyin
    Turan, Nezihe Turhan
    Onan, Aytuğ
    Lecture Notes in Networks and Systems, 2024, 1088 LNNS : 185 - 192
  • [3] Language to Code with Open Source Software
    Tang, Lei
    Mao, Xiaoguang
    Zhang, Zhuo
    PROCEEDINGS OF 2019 IEEE 10TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND SERVICE SCIENCE (ICSESS 2019), 2019, : 561 - 564
  • [4] Servicing open-source large language models for oncology
    Ray, Partha Pratim
    ONCOLOGIST, 2024,
  • [5] Benchmarking Causal Study to Interpret Large Language Models for Source Code
    Rodriguez-Cardenas, Daniel
    Palacio, David N.
    Khati, Dipin
    Burke, Henry
    Poshyvanyk, Denys
    2023 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION, ICSME, 2023, : 329 - 334
  • [6] Evaluating Source Code Quality with Large Language Models: a comparative study
    da Silva Simões, Igor Regis
    Venson, Elaine
    arXiv,
  • [7] Mapping Source Code to Software Architecture by Leveraging Large Language Models
    Johansson, Nils
    Caporuscio, Mauro
    Olsson, Tobias
    SOFTWARE ARCHITECTURE, ECSA 2024 TRACKS AND WORKSHOPS, 2024, 14937 : 133 - 149
  • [8] Requirements Verification Through the Analysis of Source Code by Large Language Models
    Couder, Juan Ortiz
    Gomez, Dawson
    Ochoa, Omar
    SOUTHEASTCON 2024, 2024, : 75 - 80
  • [9] A tutorial on open-source large language models for behavioral science
    Hussain, Zak
    Binz, Marcel
    Mata, Rui
    Wulff, Dirk U.
    BEHAVIOR RESEARCH METHODS, 2024, : 8214 - 8237
  • [10] IRSTLM: an Open Source Toolkit for Handling Large Scale Language Models
    Federico, Marcello
    Bertoldi, Nicola
    Cettolo, Mauro
    INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 1618 - 1621