WebKE: Knowledge Extraction from Semi-structured Web with Pre-trained Markup Language Model

被引:3
|
作者
Xie, Chenhao [1 ,2 ]
Huang, Wenhao [1 ]
Liang, Jiaqing [1 ,2 ]
Huang, Chengsong [1 ]
Xiao, Yanghua [1 ,3 ]
机构
[1] Fudan Univ, Shanghai Key Lab Data Sci, Sch Comp Sci, Shanghai, Peoples R China
[2] Shuyan Technol Inc, Ningbo, Peoples R China
[3] Fudan Aishu Cognit Intelligence Joint Res Ctr, Shanghai, Peoples R China
基金
中国博士后科学基金;
关键词
Semi-structured web extraction; Knowledge graph construction; Knowledge extraction; Relation extraction; WebKE; !text type='HTML']HTML[!/text]BERT; Pre-trained markup language model; SCALE INFORMATION EXTRACTION; WRAPPER;
D O I
10.1145/3459637.3482491
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The World Wide Web contains rich up-to-date information for knowledge graph construction. However, most current relation extraction techniques are designed for free text and thus do not handle well semi-structured web content. In this paper, we propose a novel multi-phase machine reading framework, called WebKE. It processes the web content on different granularity by first detecting areas of interest at DOM tree node level and then extracting relational triples for each area. We also propose HTMLBERT as an encoder the web content. It is a pre-trained markup language model that fully leverages the visual layout information and DOM-tree structure, without the need of hand engineered features. Experimental results show that the proposed approach outperforms state-of-the-art methods by a considerable gain. The source code is available at https://github.com/redreamality/webke.
引用
收藏
页码:2211 / 2220
页数:10
相关论文
共 50 条
  • [1] Information extraction from semi-structured web documents
    Yun, Bo-Hyun
    Seo, Chang-Ho
    [J]. KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, 2006, 4092 : 586 - 598
  • [2] Ceres: Harvesting Knowledge from the Semi-structured Web
    Dong, Xin Luna
    [J]. CIKM '20: PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, 2020, : 1 - 1
  • [3] A Context-free Markup Language for Semi-structured Text
    Xi, Qian
    Walker, David
    [J]. ACM SIGPLAN NOTICES, 2010, 45 (06) : 221 - 232
  • [4] A Context-free Markup Language for Semi-structured Text
    Xi, Qian
    Walker, David
    [J]. PLDI '10: PROCEEDINGS OF THE 2010 ACM SIGPLAN CONFERENCE ON PROGRAMMING LANGUAGE DESIGN AND IMPLEMENTATION, 2010, : 221 - 232
  • [5] Knowledge Enhanced Pre-trained Language Model for Product Summarization
    Yin, Wenbo
    Ren, Junxiang
    Wu, Yuejiao
    Song, Ruilin
    Liu, Lang
    Cheng, Zhen
    Wang, Sibo
    [J]. NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2022, PT II, 2022, 13552 : 263 - 273
  • [6] Bootstrapping Information Extraction from Semi-structured Web Pages
    Carlson, Andrew
    Schafer, Charles
    [J]. MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, PART I, PROCEEDINGS, 2008, 5211 : 195 - +
  • [7] Data extraction from semi-structured web pages by clustering
    Vuong, Le Phong Bao
    Gao, Xiaoying
    Zhang, Mengjie
    [J]. 2006 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, (WI 2006 MAIN CONFERENCE PROCEEDINGS), 2006, : 374 - +
  • [8] Improving Extraction of Chinese Open Relations Using Pre-trained Language Model and Knowledge Enhancement
    Wen, Chaojie
    Jia, Xudong
    Chen, Tao
    [J]. DATA INTELLIGENCE, 2023, 5 (04) : 962 - 989
  • [9] Probing Simile Knowledge from Pre-trained Language Models
    Chen, Weijie
    Chang, Yongzhu
    Zhang, Rongsheng
    Pu, Jiashu
    Chen, Guandan
    Zhang, Le
    Xi, Yadong
    Chen, Yijiang
    Su, Chang
    [J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 5875 - 5887
  • [10] Knowledge Inheritance for Pre-trained Language Models
    Qin, Yujia
    Lin, Yankai
    Yi, Jing
    Zhang, Jiajie
    Han, Xu
    Zhang, Zhengyan
    Su, Yusheng
    Liu, Zhiyuan
    Li, Peng
    Sun, Maosong
    Zhou, Jie
    [J]. NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 3921 - 3937