KnowCoder: Coding Structured Knowledge into LLMs for Universal Information Extraction

被引:0
|
作者
Li, Zixuan [1 ]
Zeng, Yutao [1 ]
Zuo, Yuxin [1 ]
Ren, Weicheng [1 ]
Liu, Wenxuan [1 ]
Su, Miao [1 ]
Guo, Yucan [1 ]
Liu, Yantao [1 ]
Li, Xiang [1 ]
Hu, Zhilei [1 ]
Bai, Long [1 ]
Li, Wei [1 ]
Liu, Yidan [1 ]
Yang, Pan [1 ]
Jin, Xiaolong [1 ]
Guo, Jiafeng [1 ]
Cheng, Xueqi [1 ]
机构
[1] Chinese Acad Sci, Key Lab Network Data Sci & Technol, Inst Comp Technol, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS | 2024年
基金
中国国家自然科学基金;
关键词
CORPUS;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we propose KnowCoder, a Large Language Model (LLM) to conduct Universal Information Extraction (UIE) via code generation. KnowCoder aims to develop a kind of unified schema representation that LLMs can easily understand and an effective learning framework that encourages LLMs to follow schemas and extract structured knowledge accurately. To achieve these, KnowCoder introduces a code-style schema representation method to uniformly transform different schemas into Python classes, with which complex schema information, such as constraints among tasks in UIE, can be captured in an LLM-friendly manner. We further construct a code-style schema library covering over 30,000 types of knowledge, which is the largest one for UIE, to the best of our knowledge. To ease the learning process of LLMs, KnowCoder contains a two-phase learning framework that enhances its schema understanding ability via code pretraining and its schema following ability via instruction tuning. After code pretraining on around 1.5B automatically constructed data, KnowCoder already attains remarkable generalization ability and achieves relative improvements by 49.8% F1, compared to LLaMA2, under the few-shot setting. After instruction tuning, KnowCoder further exhibits strong generalization ability on unseen schemas and achieves up to 12.5% and 21.9%, compared to sota baselines, under the zero-shot setting and the low resource setting, respectively. Additionally, based on our unified schema representations, various human-annotated datasets can simultaneously be utilized to refine KnowCoder, which achieves significant improvements up to 7.5% under the supervised setting.
引用
收藏
页码:8758 / 8779
页数:22
相关论文
共 50 条
  • [21] On coding with a partial knowledge of the state information
    Zaidi, Abdellatif
    Duhamel, Pierre
    2005 39th Asilomar Conference on Signals, Systems and Computers, Vols 1 and 2, 2005, : 657 - 661
  • [22] Knowledge Extraction for Information Retrieval
    Corcoglioniti, Francesco
    Dragoni, Mauro
    Rospocher, Marco
    Aprosio, Alessio Palmero
    SEMANTIC WEB: LATEST ADVANCES AND NEW DOMAINS, 2016, 9678 : 317 - 333
  • [23] Knowledge Extraction from Structured Engineering Drawings
    Lu, Tong
    Yang, Yubin
    Yang, Ruoyu
    Cai, Shijie
    FIFTH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, VOL 2, PROCEEDINGS, 2008, : 415 - 419
  • [24] Empowering LLMs for Long-Text Information Extraction in Chinese Legal Documents
    Shen, Chenchen
    Ji, Chengwei
    Yue, Shengbin
    Shen, Xiaoyu
    Song, Yun
    Huang, Xuanjing
    Wei, Zhongyu
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT I, NLPCC 2024, 2025, 15359 : 457 - 469
  • [25] Study on Product Information Coding in the Context of Universal Design
    Shan, Hongxiang
    Wang, Xingsong
    Tian, Mengqian
    Mao, Yuliang
    HUMAN SYSTEMS ENGINEERING AND DESIGN II, 2020, 1026 : 501 - 507
  • [26] Universal Source Coding for Multiple Decoders with Side Information
    Kuzuoka, Shigeaki
    Kimura, Akisato
    Uyematsu, Tomohiko
    2010 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY, 2010, : 1 - 5
  • [27] Coding FRBR-Structured Bibliographic Information in MARC
    Aalberg, Trond
    Mercun, Tanja
    Zumer, Maja
    DIGITAL LIBRARIES: FOR CULTURAL HERITAGE, KNOWLEDGE DISSEMINATION, AND FUTURE CREATION: ICADL 2011, 2011, 7008 : 128 - +
  • [28] Information extraction using the structured language model
    Chelba, C
    Mahajan, M
    PROCEEDINGS OF THE 2001 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, 2001, : 74 - 81
  • [29] Unified Structure Generation for Universal Information Extraction
    Lu, Yaojie
    Liu, Qing
    Dai, Dai
    Xiao, Xinyan
    Lin, Hongyu
    Han, Xianpei
    Sun, Le
    Wu, Hua
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 5755 - 5772
  • [30] Universal Information Extraction as Unified Semantic Matching
    Lou, Jie
    Lu, Yaojie
    Dai, Dai
    Jia, Wei
    Lin, Hongyu
    Han, Xianpei
    Sun, Le
    Wu, Hua
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 13318 - 13326