Intelligent Web Robot for Content Extraction

被引:0
|
作者
Wenxing HONG [1 ]
Jie LI [1 ]
Weiwei WANG [1 ]
Yang WENG [2 ]
机构
[1] Automation Department,Xiamen University
[2] College of Mathematics,Sichuan University
基金
国家重点研发计划;
关键词
D O I
10.15878/j.cnki.instrumentation.2019.03.007
中图分类号
TP242.6 [智能机器人]; TP391.1 [文字信息处理];
学科分类号
081104 ; 081203 ; 0835 ;
摘要
The main content of a news web page is a source of data for Natural Language Processing(NLP) and Information Retrieval(IR),which contains large quantities of valuable information.This paper proposes a method that formulates the main content extraction problem as a DOM tree node classification problem.In terms of feature extraction,we use the DOM tree node to represent HTML document and then develop multiple features by using the DOM tree node properties,such as text length,tag path,tag properties and so on.In consideration that the essence of the problem is the classification model,we use Xgboost to help select nodes.Experimental results show that the proposed approach is effective and efficient in extracting main content of new web pages,and achieves about 98% accuracy over 1083 news pages from 10 different new sites,and the average processing time per page is within 10 ms.
引用
收藏
页码:52 / 58
页数:7
相关论文
共 50 条
  • [31] The Intelligent Extraction of Academic Information Based on Web Service Discovery
    Xu, Meihui
    Wang, Yaqi
    Zhang, Jingxiang
    2013 IEEE NINTH INTERNATIONAL CONFERENCE ON MOBILE AD-HOC AND SENSOR NETWORKS (MSN 2013), 2013, : 399 - 404
  • [32] A Novel Approach for Content Extraction from Web Pages
    Bhardwaj, Aanshi
    Mangat, Veenu
    2014 RECENT ADVANCES IN ENGINEERING AND COMPUTATIONAL SCIENCES (RAECS), 2014,
  • [33] The Web as a database new extraction technologies & content management
    Adams, KC
    ONLINE, 2001, 25 (02): : 27 - +
  • [34] A new approach to content extraction from web page
    Song, Ming-Qiu
    Zhang, Rui-Xue
    Wu, Xin-Tao
    Li, Wen-Li
    Dalian Ligong Daxue Xuebao/Journal of Dalian University of Technology, 2009, 49 (04): : 594 - 597
  • [35] A Comprehensive Survey on Web Content Extraction Algorithms and Techniques
    AL-Ghuribi, Sumaia Mohammed
    Alshomrani, Saleh
    2013 INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND APPLICATIONS (ICISA 2013), 2013,
  • [36] DOM Tree Based Approach for Web Content Extraction
    Mehta, Bhavdeep
    Narvekar, Meera
    2015 International Conference on Communication, Information & Computing Technology (ICCICT), 2015,
  • [37] A Method for Web Content Extraction and Analysis in the Tourism Domain
    Oro, Ermelinda
    Ruffolo, Massimo
    ICEIS: PROCEEDINGS OF THE 19TH INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS - VOL 1, 2017, : 365 - 370
  • [38] Synonyms extraction using Web content focused crawling
    Chen, Chien-Hsing
    Hsu, Chung-Chian
    INFORMATION RETRIEVAL TECHNOLOGY, 2008, 4993 : 286 - 297
  • [39] Cooperative CG-Wrappers for web content extraction
    Kokkoras, Fotis
    Bassiliades, Nick
    Vlahavas, Ioannis
    CONCEPTUAL STRUCTURES: KNOWLEDGE ARCHITECTURES FOR SMART APPLICATIONS, PROCEEDINGS, 2007, 4604 : 476 - +
  • [40] Automatic Web Content Extraction by Combination of Learning and Grouping
    Wu, Shanchan
    Liu, Jerry
    Fan, Jian
    PROCEEDINGS OF THE 24TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW 2015), 2015, : 1264 - 1274