Learning Web Content Extraction with DOM Features

被引:0
|
作者
Utiu, Nichita [1 ]
Ionescu, Vlad-Sebastian [1 ]
机构
[1] Babes Bolyai Univ, Dept Comp Sci, 1 M Kogalniceanu St, Cluj Napoca 400084, Romania
关键词
SELECTION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Content extraction is the process that aims to separate the main content of web pages from the bulk of template and decorative components. We present a method of doing this which achieves competitive performance on the Cleaneval dataset and sets a new state-of-the-art with an F1 score of 0.96 on the Dragnet dataset. We accomplish this by modeling the task as a classification problem over HTML tags using features based on information from the DOM tree. Not only do we obtain a performance increase over current methods, but we do so with minimal feature engineering and without the extensive preprocessing steps of other methods.
引用
收藏
页码:5 / 11
页数:7
相关论文
共 50 条
  • [21] An Adaptive Web Information Extraction Approach Based on STU-DOM Tree
    Wu, Songpu
    Wang, Qing
    [J]. ADVANCED DESIGN AND MANUFACTURING TECHNOLOGY III, PTS 1-4, 2013, 397-400 : 1972 - 1978
  • [22] Combining Classification Algorithm with DOM Algorithm for Web Information Extraction - A Hybrid Approach
    Bhavanasi, Venkat Ramana
    Damodaram, A.
    [J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INFORMATION SYSTEMS DESIGN AND INTELLIGENT APPLICATIONS 2012 (INDIA 2012), 2012, 132 : 591 - +
  • [23] DOM-based multi-factor web information extraction study
    Zhang, Shun
    Chen, Xingshu
    Tan, Jun
    [J]. MATERIALS, MECHATRONICS AND AUTOMATION, PTS 1-3, 2011, 467-469 : 1267 - 1272
  • [24] DOM-Based Print-Link Detection for Web Article Extraction
    Liu, Sam
    Lim, Suk-Hwan
    Liu, Jerry
    [J]. IMAGING AND PRINTING IN A WEB 2.0 WORLD II, 2011, 7879
  • [25] The Technology of Extracting Content Information from Web Page Based on DOM Tree
    Yuan, Dingrong
    Mo, Zhuoying
    Xie, Bing
    Xie, Yangcai
    [J]. ADVANCED RESEARCH ON ELECTRONIC COMMERCE, WEB APPLICATION, AND COMMUNICATION, PT 2, 2011, 144 : 271 - 278
  • [26] X3DOM volume rendering component for web content developers
    Ander Arbelaiz
    Aitor Moreno
    Luis Kabongo
    Alejandro García-Alonso
    [J]. Multimedia Tools and Applications, 2017, 76 : 13425 - 13454
  • [27] DOM Tree Estimation and Computation: Overview of a new Web content adaptation system
    Lardon, Jeremy
    Gravier, Christophe
    Fayolle, Jacques
    [J]. EICS 2010: PROCEEDINGS OF THE 2010 ACM SIGCHI SYMPOSIUM ON ENGINEERING INTERACTIVE COMPUTING SYSTEMS, 2010, : 357 - 360
  • [28] X3DOM volume rendering component for web content developers
    Arbelaiz, Ander
    Moreno, Aitor
    Kabongo, Luis
    Garcia-Alonso, Alejandro
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2017, 76 (11) : 13425 - 13454
  • [29] A hybrid approach for content extraction with text density and visual importance of DOM nodes
    Dandan Song
    Fei Sun
    Lejian Liao
    [J]. Knowledge and Information Systems, 2015, 42 : 75 - 96
  • [30] Content Extraction from Deep Web Interfaces
    Bhakare, Unnati N.
    Chatur, Prashant N.
    [J]. 2017 INTERNATIONAL CONFERENCE OF ELECTRONICS, COMMUNICATION AND AEROSPACE TECHNOLOGY (ICECA), VOL 1, 2017, : 349 - 353