KnowCoder: Coding Structured Knowledge into LLMs for Universal Information Extraction

被引：0

作者：

Li, Zixuan ^{[1
]}

Zeng, Yutao ^{[1
]}

Zuo, Yuxin ^{[1
]}

Ren, Weicheng ^{[1
]}

Liu, Wenxuan ^{[1
]}

Su, Miao ^{[1
]}

Guo, Yucan ^{[1
]}

Liu, Yantao ^{[1
]}

Li, Xiang ^{[1
]}

Hu, Zhilei ^{[1
]}

Bai, Long ^{[1
]}

Li, Wei ^{[1
]}

Liu, Yidan ^{[1
]}

Yang, Pan ^{[1
]}

Jin, Xiaolong ^{[1
]}

Guo, Jiafeng ^{[1
]}

Cheng, Xueqi ^{[1
]}

机构：

[1] Chinese Acad Sci, Key Lab Network Data Sci & Technol, Inst Comp Technol, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS | 2024年

基金：

中国国家自然科学基金;

关键词：

CORPUS;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we propose KnowCoder, a Large Language Model (LLM) to conduct Universal Information Extraction (UIE) via code generation. KnowCoder aims to develop a kind of unified schema representation that LLMs can easily understand and an effective learning framework that encourages LLMs to follow schemas and extract structured knowledge accurately. To achieve these, KnowCoder introduces a code-style schema representation method to uniformly transform different schemas into Python classes, with which complex schema information, such as constraints among tasks in UIE, can be captured in an LLM-friendly manner. We further construct a code-style schema library covering over 30,000 types of knowledge, which is the largest one for UIE, to the best of our knowledge. To ease the learning process of LLMs, KnowCoder contains a two-phase learning framework that enhances its schema understanding ability via code pretraining and its schema following ability via instruction tuning. After code pretraining on around 1.5B automatically constructed data, KnowCoder already attains remarkable generalization ability and achieves relative improvements by 49.8% F1, compared to LLaMA2, under the few-shot setting. After instruction tuning, KnowCoder further exhibits strong generalization ability on unseen schemas and achieves up to 12.5% and 21.9%, compared to sota baselines, under the zero-shot setting and the low resource setting, respectively. Additionally, based on our unified schema representations, various human-annotated datasets can simultaneously be utilized to refine KnowCoder, which achieves significant improvements up to 7.5% under the supervised setting.

引用

页码：8758 / 8779

页数：22

共 50 条

[1] Structured Knowledge Extraction for Digital Twins: Leveraging LLMs to Analyze Tweets
Schultenkaemper, Sergej
Baeumer, Frederik Simon
INNOVATIONS FOR COMMUNITY SERVICES, I4CS 2024, 2024, 2109 : 150 - 165
[2] Leveraging LLMs for Information Extraction in Manufacturing
Matthes, Marvin
Guhr, Oliver
Krockert, Martin
Munkelt, Torsten
ADVANCES IN PRODUCTION MANAGEMENT SYSTEMS-PRODUCTION MANAGEMENT SYSTEMS FOR VOLATILE, UNCERTAIN, COMPLEX, AND AMBIGUOUS ENVIRONMENTS, APMS 2024, PT V, 2024, 732 : 355 - 366
[3] LLMs Accelerate Annotation for Medical Information Extraction
Goel, Akshay
Gueta, Almog
Gilon, Omry
Liu, Chang
Erell, Sofia
Lan Huong Nguyen
Hao, Xiaohong
Jaber, Bolous
Reddy, Shashir
Kartha, Rupesh
Steiner, Jean
Laish, Itay
Feder, Amir
MACHINE LEARNING FOR HEALTH, ML4H, VOL 225, 2023, 225 : 82 - 100
[4] Millitary Knowledge Graph Construction Based on Universal Information Extraction Models
Miao Yongfei
Zhang Yihang
Wang Li
Song Xiaoxue
Song Yuze
Tang Zekun
2024 10TH INTERNATIONAL CONFERENCE ON BIG DATA AND INFORMATION ANALYTICS, BIGDIA 2024, 2024, : 877 - 881
[5] Knowledge Extraction from LLMs for Scalable Historical Data Annotation
Celli, Fabio
Mingazov, Dmitry
ELECTRONICS, 2024, 13 (24):
[6] Universal coding for transmission of private information
Datta, Nilanjana
Hsieh, Min-Hsiu
JOURNAL OF MATHEMATICAL PHYSICS, 2010, 51 (12)
[7] UNIVERSAL CODING, INFORMATION, PREDICTION, AND ESTIMATION
RISSANEN, J
IEEE TRANSACTIONS ON INFORMATION THEORY, 1984, 30 (04) : 629 - 636
[8] A knowledge-based information extraction system for semi-structured labeled documents
Yang, JY
Oh, H
Doh, KG
Choi, J
INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2002, 2002, 2412 : 105 - 110
[9] Information Based Universal Feature Extraction
Amiri, Mohammad
Brause, Ruediger
SEVENTH INTERNATIONAL CONFERENCE ON MACHINE VISION (ICMV 2014), 2015, 9445
[10] ITAKE: Interactive Unstructured Text Annotation and Knowledge Extraction System with LLMs and ModelOps
Song, Jiahe
Ding, Hongxin
Wang, Zhiyuan
Xu, Yongxin
Zhao, Junfeng
Wang, Yasha
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 3: SYSTEM DEMONSTRATIONS, 2024, : 326 - 334

← 1 2 3 4 5 →