KnowCoder: Coding Structured Knowledge into LLMs for Universal Information Extraction

被引：0

作者：

Li, Zixuan ^{[1
]}

Zeng, Yutao ^{[1
]}

Zuo, Yuxin ^{[1
]}

Ren, Weicheng ^{[1
]}

Liu, Wenxuan ^{[1
]}

Su, Miao ^{[1
]}

Guo, Yucan ^{[1
]}

Liu, Yantao ^{[1
]}

Li, Xiang ^{[1
]}

Hu, Zhilei ^{[1
]}

Bai, Long ^{[1
]}

Li, Wei ^{[1
]}

Liu, Yidan ^{[1
]}

Yang, Pan ^{[1
]}

Jin, Xiaolong ^{[1
]}

Guo, Jiafeng ^{[1
]}

Cheng, Xueqi ^{[1
]}

机构：

[1] Chinese Acad Sci, Key Lab Network Data Sci & Technol, Inst Comp Technol, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS | 2024年

基金：

中国国家自然科学基金;

关键词：

CORPUS;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we propose KnowCoder, a Large Language Model (LLM) to conduct Universal Information Extraction (UIE) via code generation. KnowCoder aims to develop a kind of unified schema representation that LLMs can easily understand and an effective learning framework that encourages LLMs to follow schemas and extract structured knowledge accurately. To achieve these, KnowCoder introduces a code-style schema representation method to uniformly transform different schemas into Python classes, with which complex schema information, such as constraints among tasks in UIE, can be captured in an LLM-friendly manner. We further construct a code-style schema library covering over 30,000 types of knowledge, which is the largest one for UIE, to the best of our knowledge. To ease the learning process of LLMs, KnowCoder contains a two-phase learning framework that enhances its schema understanding ability via code pretraining and its schema following ability via instruction tuning. After code pretraining on around 1.5B automatically constructed data, KnowCoder already attains remarkable generalization ability and achieves relative improvements by 49.8% F1, compared to LLaMA2, under the few-shot setting. After instruction tuning, KnowCoder further exhibits strong generalization ability on unseen schemas and achieves up to 12.5% and 21.9%, compared to sota baselines, under the zero-shot setting and the low resource setting, respectively. Additionally, based on our unified schema representations, various human-annotated datasets can simultaneously be utilized to refine KnowCoder, which achieves significant improvements up to 7.5% under the supervised setting.

引用

页码：8758 / 8779

页数：22

共 50 条

[21] On coding with a partial knowledge of the state information
Zaidi, Abdellatif
Duhamel, Pierre
2005 39th Asilomar Conference on Signals, Systems and Computers, Vols 1 and 2, 2005, : 657 - 661
[22] Knowledge Extraction for Information Retrieval
Corcoglioniti, Francesco
Dragoni, Mauro
Rospocher, Marco
Aprosio, Alessio Palmero
SEMANTIC WEB: LATEST ADVANCES AND NEW DOMAINS, 2016, 9678 : 317 - 333
[23] Knowledge Extraction from Structured Engineering Drawings
Lu, Tong
Yang, Yubin
Yang, Ruoyu
Cai, Shijie
FIFTH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, VOL 2, PROCEEDINGS, 2008, : 415 - 419
[24] Empowering LLMs for Long-Text Information Extraction in Chinese Legal Documents
Shen, Chenchen
Ji, Chengwei
Yue, Shengbin
Shen, Xiaoyu
Song, Yun
Huang, Xuanjing
Wei, Zhongyu
NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT I, NLPCC 2024, 2025, 15359 : 457 - 469
[25] Study on Product Information Coding in the Context of Universal Design
Shan, Hongxiang
Wang, Xingsong
Tian, Mengqian
Mao, Yuliang
HUMAN SYSTEMS ENGINEERING AND DESIGN II, 2020, 1026 : 501 - 507
[26] Universal Source Coding for Multiple Decoders with Side Information
Kuzuoka, Shigeaki
Kimura, Akisato
Uyematsu, Tomohiko
2010 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY, 2010, : 1 - 5
[27] Coding FRBR-Structured Bibliographic Information in MARC
Aalberg, Trond
Mercun, Tanja
Zumer, Maja
DIGITAL LIBRARIES: FOR CULTURAL HERITAGE, KNOWLEDGE DISSEMINATION, AND FUTURE CREATION: ICADL 2011, 2011, 7008 : 128 - +
[28] Information extraction using the structured language model
Chelba, C
Mahajan, M
PROCEEDINGS OF THE 2001 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, 2001, : 74 - 81
[29] Unified Structure Generation for Universal Information Extraction
Lu, Yaojie
Liu, Qing
Dai, Dai
Xiao, Xinyan
Lin, Hongyu
Han, Xianpei
Sun, Le
Wu, Hua
PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 5755 - 5772
[30] Universal Information Extraction as Unified Semantic Matching
Lou, Jie
Lu, Yaojie
Dai, Dai
Jia, Wei
Lin, Hongyu
Han, Xianpei
Sun, Le
Wu, Hua
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 13318 - 13326

← 1 2 3 4 5 →