WebKE: Knowledge Extraction from Semi-structured Web with Pre-trained Markup Language Model

被引：3

作者：

Xie, Chenhao ^{[1
,2
]}

Huang, Wenhao ^{[1
]}

Liang, Jiaqing ^{[1
,2
]}

Huang, Chengsong ^{[1
]}

Xiao, Yanghua ^{[1
,3
]}

机构：

[1] Fudan Univ, Shanghai Key Lab Data Sci, Sch Comp Sci, Shanghai, Peoples R China

[2] Shuyan Technol Inc, Ningbo, Peoples R China

[3] Fudan Aishu Cognit Intelligence Joint Res Ctr, Shanghai, Peoples R China

来源：

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021 | 2021年

基金：

中国博士后科学基金;

关键词：

Semi-structured web extraction; Knowledge graph construction; Knowledge extraction; Relation extraction; WebKE; !text type='HTML']HTML[!/text]BERT; Pre-trained markup language model; SCALE INFORMATION EXTRACTION; WRAPPER;

D O I：

10.1145/3459637.3482491

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The World Wide Web contains rich up-to-date information for knowledge graph construction. However, most current relation extraction techniques are designed for free text and thus do not handle well semi-structured web content. In this paper, we propose a novel multi-phase machine reading framework, called WebKE. It processes the web content on different granularity by first detecting areas of interest at DOM tree node level and then extracting relational triples for each area. We also propose HTMLBERT as an encoder the web content. It is a pre-trained markup language model that fully leverages the visual layout information and DOM-tree structure, without the need of hand engineered features. Experimental results show that the proposed approach outperforms state-of-the-art methods by a considerable gain. The source code is available at https://github.com/redreamality/webke.

引用

页码：2211 / 2220

页数：10

共 50 条

[1] Information extraction from semi-structured web documents
Yun, Bo-Hyun
Seo, Chang-Ho
[J]. KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, 2006, 4092 : 586 - 598
[2] Ceres: Harvesting Knowledge from the Semi-structured Web
Dong, Xin Luna
[J]. CIKM '20: PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, 2020, : 1 - 1
[3] A Context-free Markup Language for Semi-structured Text
Xi, Qian
Walker, David
[J]. ACM SIGPLAN NOTICES, 2010, 45 (06) : 221 - 232
[4] A Context-free Markup Language for Semi-structured Text
Xi, Qian
Walker, David
[J]. PLDI '10: PROCEEDINGS OF THE 2010 ACM SIGPLAN CONFERENCE ON PROGRAMMING LANGUAGE DESIGN AND IMPLEMENTATION, 2010, : 221 - 232
[5] Knowledge Enhanced Pre-trained Language Model for Product Summarization
Yin, Wenbo
Ren, Junxiang
Wu, Yuejiao
Song, Ruilin
Liu, Lang
Cheng, Zhen
Wang, Sibo
[J]. NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2022, PT II, 2022, 13552 : 263 - 273
[6] Bootstrapping Information Extraction from Semi-structured Web Pages
Carlson, Andrew
Schafer, Charles
[J]. MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, PART I, PROCEEDINGS, 2008, 5211 : 195 - +
[7] Data extraction from semi-structured web pages by clustering
Vuong, Le Phong Bao
Gao, Xiaoying
Zhang, Mengjie
[J]. 2006 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, (WI 2006 MAIN CONFERENCE PROCEEDINGS), 2006, : 374 - +
[8] Improving Extraction of Chinese Open Relations Using Pre-trained Language Model and Knowledge Enhancement
Wen, Chaojie
Jia, Xudong
Chen, Tao
[J]. DATA INTELLIGENCE, 2023, 5 (04) : 962 - 989
[9] Probing Simile Knowledge from Pre-trained Language Models
Chen, Weijie
Chang, Yongzhu
Zhang, Rongsheng
Pu, Jiashu
Chen, Guandan
Zhang, Le
Xi, Yadong
Chen, Yijiang
Su, Chang
[J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 5875 - 5887
[10] Knowledge Inheritance for Pre-trained Language Models
Qin, Yujia
Lin, Yankai
Yi, Jing
Zhang, Jiajie
Han, Xu
Zhang, Zhengyan
Su, Yusheng
Liu, Zhiyuan
Li, Peng
Sun, Maosong
Zhou, Jie
[J]. NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 3921 - 3937

← 1 2 3 4 5 →