IEPILE: Unearthing Large-Scale Schema-Based Information Extraction Corpus

被引：0

作者：

Gui, Honghao ^{[1
,2
]}

Yuan, Lin ^{[2
,3
]}

Ye, Hongbin ^{[1
]}

Zhang, Ningyu ^{[1
,3
]}

Sun, Mengshu ^{[2
,3
]}

Liang, Lei ^{[2
,3
]}

Chen, Huajun ^{[1
,3
]}

机构：

[1] Zhejiang Univ, Hangzhou, Peoples R China

[2] Ant Grp, Hangzhou, Peoples R China

[3] Zhejiang Univ Ant Grp Joint Lab Knowledge Graph, Hangzhou, Peoples R China

来源：

PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2: SHORT PAPERS | 2024年

基金：

中国国家自然科学基金;

关键词：

RECOGNITION;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large Language Models (LLMs) demonstrate remarkable potential across various domains; however, they exhibit a significant performance gap in Information Extraction (IE). Note that high-quality instruction data is the vital key for enhancing the specific capabilities of LLMs, while current IE datasets tend to be small in scale, fragmented, and lack standardized schema. To this end, we introduce IEPILE, a comprehensive bilingual (English and Chinese) IE instruction corpus, which contains approximately 0.32B tokens. We construct IEPILE by collecting and cleaning 33 existing IE datasets, and introduce schema-based instruction generation to unearth a large-scale corpus. Experimentally, IEPILE enhance the performance of LLMs for IE, with notable improvements in zero-shot generalization. We open-source the resource and pre-trained models, hoping to provide valuable support to the NLP community.

引用

页码：127 / 146

页数：20

共 50 条

[1] An annotated corpus of clinical trial publications supporting schema-based relational information extraction
Olivia Sanchez-Graillet
Christian Witte
Frank Grimm
Philipp Cimiano
Journal of Biomedical Semantics, 13
[2] An annotated corpus of clinical trial publications supporting schema-based relational information extraction
Sanchez-Graillet, Olivia
Witte, Christian
Grimm, Frank
Cimiano, Philipp
JOURNAL OF BIOMEDICAL SEMANTICS, 2022, 13 (01)
[3] SCHEMA-BASED AUTHORING AND QUERYING OF LARGE HYPERTEXTS
AMANN, B
SCHOLL, M
RIZK, A
INTERNATIONAL JOURNAL OF HUMAN-COMPUTER STUDIES, 1995, 43 (03) : 281 - 299
[4] Automated Schema Quality Measurement in Large-Scale Information Systems
Ehrlinger, Lisa
Woess, Wolfram
DATA QUALITY AND TRUST IN BIG DATA, 2019, 11235 : 16 - 31
[5] Information order and sign design - A schema-based approach
Smith-Jackson, TL
Hall, TE
ENVIRONMENT AND BEHAVIOR, 2002, 34 (04) : 479 - 492
[6] Information extraction system in large-scale web
Hong, F
Zhao, Z
International Symposium on Communications and Information Technologies 2005, Vols 1 and 2, Proceedings, 2005, : 783 - 786
[7] Temporal knowledge extraction from large-scale text corpus
Yu Liu
Wen Hua
Xiaofang Zhou
World Wide Web, 2021, 24 : 135 - 156
[8] Creating A Large-Scale Financial News Corpus for Relation Extraction
Wu, Haoyu
Lei, Qing
Zhang, Xinyue
Luo, Zhengqian
2020 3RD INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND BIG DATA (ICAIBD 2020), 2020, : 259 - 263
[9] Temporal knowledge extraction from large-scale text corpus
Liu, Yu
Hua, Wen
Zhou, Xiaofang
WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2021, 24 (01): : 135 - 156
[10] Information integration in schema-based peer-to-peer networks
Löser, A
Siberski, W
Wolpers, M
Nejdl, W
ADVANCED INFORMATION SYSTEMS ENGINEERING, PROCEEDINGS, 2003, 2681 : 258 - 272

← 1 2 3 4 5 →