The (ab)use of Open Source Code to Train Large Language Models

被引:4
|
作者
Al-Kaswan, Ali [1 ]
Izadi, Maliheh [1 ]
机构
[1] Delft Univ Technol, Delft, Netherlands
关键词
D O I
10.1109/NLBSE59153.2023.00008
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, Large Language Models (LLMs) have gained significant popularity due to their ability to generate human-like text and their potential applications in various fields, such as Software Engineering. LLMs for Code are commonly trained on large unsanitized corpora of source code scraped from the Internet. The content of these datasets is memorized and emitted by the models, often in a verbatim manner. In this work, we will discuss the security, privacy, and licensing implications of memorization. We argue why the use of copyleft code to train LLMs is a legal and ethical dilemma. Finally, we provide four actionable recommendations to address this issue.
引用
收藏
页码:9 / 10
页数:2
相关论文
共 50 条
  • [21] SGL: A domain-specific language for large-scale analysis of open-source code
    Foo, Darius
    Yi, Ang Ming
    Yeo, Jason
    Sharma, Asankhaya
    2018 IEEE CYBERSECURITY DEVELOPMENT CONFERENCE (SECDEV 2018), 2018, : 61 - 68
  • [22] Can large language models generate geospatial code?
    State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, China
    不详
    arXiv, 1600,
  • [23] Code Soliloquies for Accurate Calculations in Large Language Models
    Sonkar, Shashank
    Chen, Xinghe
    Le, MyCo
    Liu, Naiming
    Mallick, Debshila Basu
    Baraniuk, Richard G.
    FOURTEENTH INTERNATIONAL CONFERENCE ON LEARNING ANALYTICS & KNOWLEDGE, LAK 2024, 2024, : 828 - 835
  • [24] Analyzing Declarative Deployment Code with Large Language Models
    Lanciano, Giacomo
    Stein, Manuel
    Hilt, Volker
    Cucinotta, Tommaso
    PROCEEDINGS OF THE 13TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND SERVICES SCIENCE, CLOSER 2023, 2023, : 289 - 296
  • [25] OCTOPACK: INSTRUCTION TUNING CODE LARGE LANGUAGE MODELS
    Muennighoff, Niklas
    Liu, Qian
    Zebaze, Armel
    Zheng, Qinkai
    Hui, Binyuan
    Zhuo, Terry Yue
    Singh, Swayam
    Tang, Xiangru
    von Werra, Leandro
    Longpre, Shayne
    arXiv, 2023,
  • [26] Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code
    Karampatsis, Rafael-Michael
    Babii, Hlib
    Robbes, Romain
    Sutton, Charles
    Janes, Andrea
    2020 ACM/IEEE 42ND INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2020), 2020, : 1073 - 1085
  • [27] Large Language Models for Code Obfuscation Evaluation of the Obfuscation Capabilities of OpenAI's GPT-3.5 on C Source Code
    Kochberger, Patrick
    Gramberger, Maximilian
    Schrittwieser, Sebastian
    Lawitschka, Caroline
    Weippl, Edgar R.
    PROCEEDINGS OF THE 20TH INTERNATIONAL CONFERENCE ON SECURITY AND CRYPTOGRAPHY, SECRYPT 2023, 2023, : 7 - 19
  • [28] Iterative Refactoring of Real-World Open-Source Programs with Large Language Models
    Choi, Jinsu
    An, Gabin
    Yoo, Shin
    SEARCH-BASED SOFTWARE ENGINEERING, SSBSE 2024, 2024, 14767 : 49 - 55
  • [29] Closing the gap between open source and commercial large language models for medical evidence summarization
    Zhang, Gongbo
    Jin, Qiao
    Zhou, Yiliang
    Wang, Song
    Idnay, Betina
    Luo, Yiming
    Park, Elizabeth
    Nestor, Jordan G.
    Spotnitz, Matthew E.
    Soroush, Ali
    Campion Jr, Thomas R.
    Lu, Zhiyong
    Weng, Chunhua
    Peng, Yifan
    NPJ DIGITAL MEDICINE, 2024, 7 (01):
  • [30] Evaluation of Open-Source Large Language Models for Metal-Organic Frameworks Research
    Bai, Xuefeng
    Xie, Yabo
    Zhang, Xin
    Han, Honggui
    Li, Jian-Rong
    JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2024, 64 (13) : 4958 - 4965