A Method for Efficient Structured Data Generation with Large Language Models

被引:0
|
作者
Hou, Zongzhi [1 ]
Zhao, Ruohan [1 ]
Li, Zhongyang [1 ]
Wang, Zheng [1 ]
Wu, Yizhen [1 ]
Gou, Junwei [1 ]
Zhu, Zhifeng [1 ]
机构
[1] Huawei, Shanghai, Peoples R China
关键词
Multi-modality; Data Generation; Artificial Intelligence; Large Language Model;
D O I
10.1145/3688866.3689127
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the rapid advancement of large language model technology, the data utilized for training these models has become increasingly significant. The quality of text data samples produced by large unsupervised models is often inadequate, leading to insufficient outcomes. This inadequacy arises from the model's constrained capacity to precisely emulate the underlying structure of the data without direct supervision, resulting in outputs that may lack the necessary fidelity and relevance to the authentic data distribution. In order to overcome the shortcomings of training data generation for specific language generation tasks, this paper proposes a fast data generation system (Fast Data Generation System, FDGS) that can handle multi-modal and structured data generation. As a method for generating data, FDGS uses clustering abstraction to handle multiple data input types through templates. This approach allows for quick data generation and reduces consumption. FDGS is robust, ensuring stable and reliable performance under various conditions. It is more cost-effective in terms of token usage compared to traditional methods that work on a per-instance basis and do not use templates. By abstracting and clustering different input types, FDGS can efficiently generate data from large models. This system is highly adaptable, making it a great choice for multi-modal data generation tasks. It relies on the basic functions of general large-scale language models and employs a query-answer bidirectional generation mechanism to achieve fast data amplification.
引用
收藏
页码:36 / 44
页数:9
相关论文
共 50 条
  • [21] Level Generation Through Large Language Models
    Todd, Graham
    Earle, Sam
    Nasir, Muhammad Umair
    Green, Michael Cerny
    Togelius, Julian
    PROCEEDINGS OF THE 18TH INTERNATIONAL CONFERENCE ON THE FOUNDATIONS OF DIGITAL GAMES, FDG 2023, 2023,
  • [22] On the Capacity of Citation Generation by Large Language Models
    Qian, Haosheng
    Fan, Yixing
    Zhang, Ruqing
    Guo, Jiafeng
    INFORMATION RETRIEVAL, CCIR 2024, 2025, 15418 : 109 - 123
  • [23] Retrieval augmentation of large language models for lay language generation
    Guo, Yue
    Qiu, Wei
    Leroy, Gondy
    Wang, Sheng
    Cohen, Trevor
    JOURNAL OF BIOMEDICAL INFORMATICS, 2024, 149
  • [24] Retrieval augmentation of large language models for lay language generation
    Guo, Yue
    Qiu, Wei
    Leroy, Gondy
    Wang, Sheng
    Cohen, Trevor
    Journal of Biomedical Informatics, 2024, 149
  • [25] Generating Data for Symbolic Language with Large Language Models
    Ye, Jiacheng
    Li, Chengzu
    Kong, Lingpeng
    Yu, Tao
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 8418 - 8443
  • [26] Large language models for overcoming language barriers in obstetric anaesthesia: a structured assessment
    Lomas, A.
    Broom, M. A.
    INTERNATIONAL JOURNAL OF OBSTETRIC ANESTHESIA, 2024, 60
  • [27] Two Directions for Clinical Data Generation with Large Language Models: Data-to-Label and Label-to-Data
    Li, Rumeng
    Wang, Xun
    Yu, Hong
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 7129 - 7143
  • [28] Large Language Model-Driven Structured Output: A Comprehensive Benchmark and Spatial Data Generation Framework
    Li, Diya
    Zhao, Yue
    Wang, Zhifang
    Jung, Calvin
    Zhang, Zhe
    ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION, 2024, 13 (11)
  • [29] The interaction of structured data using openEHR and large Language models for clinical decision support in prostate cancer
    Kaiser, Philippe
    Yang, Shan
    Bach, Michael
    Breit, Christian
    Mertz, Kirsten
    Stieltjes, Bram
    Ebbing, Jan
    Wetterauer, Christian
    Henkel, Maurice
    WORLD JOURNAL OF UROLOGY, 2025, 43 (01)
  • [30] Knowledge-tuning Large Language Models with Structured Medical Knowledge Bases for Trustworthy Response Generation in Chinese
    Wang, Haochun
    Zhao, Sendong
    Qiang, Zewen
    Li, Zijian
    Liu, Chi
    Xi, Nuwa
    Du, Yanrui
    Qin, Bing
    Liu, Ting
    ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2025, 19 (02)