KnowCoder: Coding Structured Knowledge into LLMs for Universal Information Extraction

被引:0
|
作者
Li, Zixuan [1 ]
Zeng, Yutao [1 ]
Zuo, Yuxin [1 ]
Ren, Weicheng [1 ]
Liu, Wenxuan [1 ]
Su, Miao [1 ]
Guo, Yucan [1 ]
Liu, Yantao [1 ]
Li, Xiang [1 ]
Hu, Zhilei [1 ]
Bai, Long [1 ]
Li, Wei [1 ]
Liu, Yidan [1 ]
Yang, Pan [1 ]
Jin, Xiaolong [1 ]
Guo, Jiafeng [1 ]
Cheng, Xueqi [1 ]
机构
[1] Chinese Acad Sci, Key Lab Network Data Sci & Technol, Inst Comp Technol, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS | 2024年
基金
中国国家自然科学基金;
关键词
CORPUS;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we propose KnowCoder, a Large Language Model (LLM) to conduct Universal Information Extraction (UIE) via code generation. KnowCoder aims to develop a kind of unified schema representation that LLMs can easily understand and an effective learning framework that encourages LLMs to follow schemas and extract structured knowledge accurately. To achieve these, KnowCoder introduces a code-style schema representation method to uniformly transform different schemas into Python classes, with which complex schema information, such as constraints among tasks in UIE, can be captured in an LLM-friendly manner. We further construct a code-style schema library covering over 30,000 types of knowledge, which is the largest one for UIE, to the best of our knowledge. To ease the learning process of LLMs, KnowCoder contains a two-phase learning framework that enhances its schema understanding ability via code pretraining and its schema following ability via instruction tuning. After code pretraining on around 1.5B automatically constructed data, KnowCoder already attains remarkable generalization ability and achieves relative improvements by 49.8% F1, compared to LLaMA2, under the few-shot setting. After instruction tuning, KnowCoder further exhibits strong generalization ability on unseen schemas and achieves up to 12.5% and 21.9%, compared to sota baselines, under the zero-shot setting and the low resource setting, respectively. Additionally, based on our unified schema representations, various human-annotated datasets can simultaneously be utilized to refine KnowCoder, which achieves significant improvements up to 7.5% under the supervised setting.
引用
收藏
页码:8758 / 8779
页数:22
相关论文
共 50 条
  • [1] Structured Knowledge Extraction for Digital Twins: Leveraging LLMs to Analyze Tweets
    Schultenkaemper, Sergej
    Baeumer, Frederik Simon
    INNOVATIONS FOR COMMUNITY SERVICES, I4CS 2024, 2024, 2109 : 150 - 165
  • [2] Leveraging LLMs for Information Extraction in Manufacturing
    Matthes, Marvin
    Guhr, Oliver
    Krockert, Martin
    Munkelt, Torsten
    ADVANCES IN PRODUCTION MANAGEMENT SYSTEMS-PRODUCTION MANAGEMENT SYSTEMS FOR VOLATILE, UNCERTAIN, COMPLEX, AND AMBIGUOUS ENVIRONMENTS, APMS 2024, PT V, 2024, 732 : 355 - 366
  • [3] LLMs Accelerate Annotation for Medical Information Extraction
    Goel, Akshay
    Gueta, Almog
    Gilon, Omry
    Liu, Chang
    Erell, Sofia
    Lan Huong Nguyen
    Hao, Xiaohong
    Jaber, Bolous
    Reddy, Shashir
    Kartha, Rupesh
    Steiner, Jean
    Laish, Itay
    Feder, Amir
    MACHINE LEARNING FOR HEALTH, ML4H, VOL 225, 2023, 225 : 82 - 100
  • [4] Millitary Knowledge Graph Construction Based on Universal Information Extraction Models
    Miao Yongfei
    Zhang Yihang
    Wang Li
    Song Xiaoxue
    Song Yuze
    Tang Zekun
    2024 10TH INTERNATIONAL CONFERENCE ON BIG DATA AND INFORMATION ANALYTICS, BIGDIA 2024, 2024, : 877 - 881
  • [5] Knowledge Extraction from LLMs for Scalable Historical Data Annotation
    Celli, Fabio
    Mingazov, Dmitry
    ELECTRONICS, 2024, 13 (24):
  • [6] Universal coding for transmission of private information
    Datta, Nilanjana
    Hsieh, Min-Hsiu
    JOURNAL OF MATHEMATICAL PHYSICS, 2010, 51 (12)
  • [7] UNIVERSAL CODING, INFORMATION, PREDICTION, AND ESTIMATION
    RISSANEN, J
    IEEE TRANSACTIONS ON INFORMATION THEORY, 1984, 30 (04) : 629 - 636
  • [8] A knowledge-based information extraction system for semi-structured labeled documents
    Yang, JY
    Oh, H
    Doh, KG
    Choi, J
    INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2002, 2002, 2412 : 105 - 110
  • [9] Information Based Universal Feature Extraction
    Amiri, Mohammad
    Brause, Ruediger
    SEVENTH INTERNATIONAL CONFERENCE ON MACHINE VISION (ICMV 2014), 2015, 9445
  • [10] ITAKE: Interactive Unstructured Text Annotation and Knowledge Extraction System with LLMs and ModelOps
    Song, Jiahe
    Ding, Hongxin
    Wang, Zhiyuan
    Xu, Yongxin
    Zhao, Junfeng
    Wang, Yasha
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 3: SYSTEM DEMONSTRATIONS, 2024, : 326 - 334