How to Protect Copyright Data in Optimization of Large Language Models?

被引:0
|
作者
Chu, Timothy [1 ]
Song, Zhao [2 ]
Yang, Chiwun [3 ]
机构
[1] Google, Mountain View, CA 94043 USA
[2] Adobe Res, San Jose, CA USA
[3] Sun Yat Sen Univ, Guangzhou, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large language models (LLMs) and generative AI have played a transformative role in computer research and applications. Controversy has arisen as to whether these models output copyrighted data, which can occur if the data the models are trained on is copyrighted. LLMs are built on the transformer neural network architecture, which in turn relies on a mathematical computation called Attention that uses the softmax function. In this paper, we observe that large language model training and optimization can be seen as a softmax regression problem. We then establish a method of efficiently performing softmax regression, in a way that prevents the regression function from generating copyright data. This establishes a theoretical method of training large language models in a way that avoids generating copyright data.
引用
收藏
页码:17871 / 17879
页数:9
相关论文
共 50 条
  • [31] How to harness natural language processing tools and large language models for psychological research
    Fischer, Ronald
    INTERNATIONAL JOURNAL OF PSYCHOLOGY, 2024, 59 : 19 - 20
  • [32] Large Language Models for Tabular Data: Progresses and Future Directions
    Dong, Haoyu
    Wang, Zhiruo
    PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 2997 - 3000
  • [33] Incorporating Citizen-Generated Data into Large Language Models
    Vadapalli, Jagadeesh
    Gupta, Srishti
    Karki, Bishwa
    Tsai, Chun-Hua
    PROCEEDINGS OF THE 25TH ANNUAL INTERNATIONAL CONFERENCE ON DIGITAL GOVERNMENT RESEARCH, DGO 2024, 2024, : 1023 - 1025
  • [34] IterClean: An Iterative Data Cleaning Framework with Large Language Models
    Ni, Wei
    Zhang, Kaihang
    Miao, Xiaoye
    Zhao, Xiangyu
    Wu, Yangyang
    Yin, Jianwei
    PROCEEDINGS OF THE ACM TURING AWARD CELEBRATION CONFERENCE-CHINA 2024, ACM-TURC 2024, 2024, : 100 - 105
  • [35] Data science opportunities of large language models for neuroscience and biomedicine
    Bzdok, Danilo
    Thieme, Andrew
    Levkovskyy, Oleksiy
    Wren, Paul
    Ray, Thomas
    Reddy, Siva
    NEURON, 2024, 112 (05) : 698 - 717
  • [36] Large language models and synthetic health data: progress and prospects
    Smolyak, Daniel
    Bjarnadottir, Margret, V
    Crowley, Kenyon
    Agarwal, Ritu
    JAMIA OPEN, 2024, 7 (04)
  • [37] Bridging the data gap between children and large language models
    Frank, Michael C.
    TRENDS IN COGNITIVE SCIENCES, 2023, 27 (11) : 990 - 992
  • [38] QueryMintAI: Multipurpose Multimodal Large Language Models for Personal Data
    Ghosh, Ananya
    Deepa, K.
    IEEE ACCESS, 2024, 12 : 144631 - 144651
  • [39] A Method for Efficient Structured Data Generation with Large Language Models
    Hou, Zongzhi
    Zhao, Ruohan
    Li, Zhongyang
    Wang, Zheng
    Wu, Yizhen
    Gou, Junwei
    Zhu, Zhifeng
    PROCEEDINGS OF THE 2ND WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM(CUBE)A 2024, 2024, : 36 - 44
  • [40] Using Large Language Models to Enhance the Reusability of Sensor Data
    Berenguer, Alberto
    Morejon, Adriana
    Tomas, David
    Mazon, Jose-Norberto
    SENSORS, 2024, 24 (02)