Learning Web Content Extraction with DOM Features

被引:0
|
作者
Utiu, Nichita [1 ]
Ionescu, Vlad-Sebastian [1 ]
机构
[1] Babes Bolyai Univ, Dept Comp Sci, 1 M Kogalniceanu St, Cluj Napoca 400084, Romania
关键词
SELECTION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Content extraction is the process that aims to separate the main content of web pages from the bulk of template and decorative components. We present a method of doing this which achieves competitive performance on the Cleaneval dataset and sets a new state-of-the-art with an F1 score of 0.96 on the Dragnet dataset. We accomplish this by modeling the task as a classification problem over HTML tags using features based on information from the DOM tree. Not only do we obtain a performance increase over current methods, but we do so with minimal feature engineering and without the extensive preprocessing steps of other methods.
引用
收藏
页码:5 / 11
页数:7
相关论文
共 50 条
  • [1] DOM Tree Based Approach for Web Content Extraction
    Mehta, Bhavdeep
    Narvekar, Meera
    [J]. 2015 International Conference on Communication, Information & Computing Technology (ICCICT), 2015,
  • [2] Web Content Information Extraction Based on DOM Tree and Statistical Information
    Yu, Xin
    Jin, Zhengping
    [J]. 2017 17TH IEEE INTERNATIONAL CONFERENCE ON COMMUNICATION TECHNOLOGY (ICCT 2017), 2017, : 1308 - 1311
  • [3] Using the DOM Tree for Content Extraction
    Lopez, Sergio
    Silva, Josep
    Insa, David
    [J]. ELECTRONIC PROCEEDINGS IN THEORETICAL COMPUTER SCIENCE, 2012, (98): : 46 - 59
  • [4] Automatic Web Content Extraction by Combination of Learning and Grouping
    Wu, Shanchan
    Liu, Jerry
    Fan, Jian
    [J]. PROCEEDINGS OF THE 24TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW 2015), 2015, : 1264 - 1274
  • [5] Extracting Content for News Web Pages based on DOM
    Geng, Hua
    Gao, Qiang
    Pan, Jingui
    [J]. INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2007, 7 (02): : 124 - 129
  • [6] DOM Based Content Extraction via Text Density
    Sun, Fei
    Song, Dandan
    Liao, Lejian
    [J]. PROCEEDINGS OF THE 34TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR'11), 2011, : 245 - 254
  • [7] Web Article Extraction for Web Printing: a DOM plus Visual based Approach
    Luo, Ping
    Fan, Jian
    Liu, Sam
    Lin, Fen
    Xiong, Yuhong
    Liu, Jerry
    [J]. DOCENG'09: PROCEEDINGS OF THE 2009 ACM SYMPOSIUM ON DOCUMENT ENGINEERING, 2009, : 66 - 69
  • [8] Machine-Learning directed Article Detection on the Web using DOM and text-based features
    Mathur, Shobhit
    Nikam, Pritam
    Patidar, Harshita
    Gaikwad, Rohan Bapusaheb
    Nayak, Preeti Narayan
    [J]. 2021 IEEE 18TH ANNUAL CONSUMER COMMUNICATIONS & NETWORKING CONFERENCE (CCNC), 2021,
  • [9] An improved DOM-based algorithm for Web information extraction
    Zhang, Li
    Li, Meng
    Dong, Nannan
    Wang, Yuanlong
    [J]. Journal of Information and Computational Science, 2011, 8 (07): : 1113 - 1121
  • [10] Joint Learning of Structural and Textual Features for Web Scale Event Extraction
    Wiedmann, Julia
    [J]. THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 5056 - 5057