IEPILE: Unearthing Large-Scale Schema-Based Information Extraction Corpus

被引:0
|
作者
Gui, Honghao [1 ,2 ]
Yuan, Lin [2 ,3 ]
Ye, Hongbin [1 ]
Zhang, Ningyu [1 ,3 ]
Sun, Mengshu [2 ,3 ]
Liang, Lei [2 ,3 ]
Chen, Huajun [1 ,3 ]
机构
[1] Zhejiang Univ, Hangzhou, Peoples R China
[2] Ant Grp, Hangzhou, Peoples R China
[3] Zhejiang Univ Ant Grp Joint Lab Knowledge Graph, Hangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
RECOGNITION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large Language Models (LLMs) demonstrate remarkable potential across various domains; however, they exhibit a significant performance gap in Information Extraction (IE). Note that high-quality instruction data is the vital key for enhancing the specific capabilities of LLMs, while current IE datasets tend to be small in scale, fragmented, and lack standardized schema. To this end, we introduce IEPILE, a comprehensive bilingual (English and Chinese) IE instruction corpus, which contains approximately 0.32B tokens. We construct IEPILE by collecting and cleaning 33 existing IE datasets, and introduce schema-based instruction generation to unearth a large-scale corpus. Experimentally, IEPILE enhance the performance of LLMs for IE, with notable improvements in zero-shot generalization. We open-source the resource and pre-trained models, hoping to provide valuable support to the NLP community.
引用
收藏
页码:127 / 146
页数:20
相关论文
共 50 条
  • [1] An annotated corpus of clinical trial publications supporting schema-based relational information extraction
    Olivia Sanchez-Graillet
    Christian Witte
    Frank Grimm
    Philipp Cimiano
    Journal of Biomedical Semantics, 13
  • [2] An annotated corpus of clinical trial publications supporting schema-based relational information extraction
    Sanchez-Graillet, Olivia
    Witte, Christian
    Grimm, Frank
    Cimiano, Philipp
    JOURNAL OF BIOMEDICAL SEMANTICS, 2022, 13 (01)
  • [3] SCHEMA-BASED AUTHORING AND QUERYING OF LARGE HYPERTEXTS
    AMANN, B
    SCHOLL, M
    RIZK, A
    INTERNATIONAL JOURNAL OF HUMAN-COMPUTER STUDIES, 1995, 43 (03) : 281 - 299
  • [4] Automated Schema Quality Measurement in Large-Scale Information Systems
    Ehrlinger, Lisa
    Woess, Wolfram
    DATA QUALITY AND TRUST IN BIG DATA, 2019, 11235 : 16 - 31
  • [5] Information order and sign design - A schema-based approach
    Smith-Jackson, TL
    Hall, TE
    ENVIRONMENT AND BEHAVIOR, 2002, 34 (04) : 479 - 492
  • [6] Information extraction system in large-scale web
    Hong, F
    Zhao, Z
    International Symposium on Communications and Information Technologies 2005, Vols 1 and 2, Proceedings, 2005, : 783 - 786
  • [7] Temporal knowledge extraction from large-scale text corpus
    Yu Liu
    Wen Hua
    Xiaofang Zhou
    World Wide Web, 2021, 24 : 135 - 156
  • [8] Creating A Large-Scale Financial News Corpus for Relation Extraction
    Wu, Haoyu
    Lei, Qing
    Zhang, Xinyue
    Luo, Zhengqian
    2020 3RD INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND BIG DATA (ICAIBD 2020), 2020, : 259 - 263
  • [9] Temporal knowledge extraction from large-scale text corpus
    Liu, Yu
    Hua, Wen
    Zhou, Xiaofang
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2021, 24 (01): : 135 - 156
  • [10] Information integration in schema-based peer-to-peer networks
    Löser, A
    Siberski, W
    Wolpers, M
    Nejdl, W
    ADVANCED INFORMATION SYSTEMS ENGINEERING, PROCEEDINGS, 2003, 2681 : 258 - 272