Intelligent Web Robot for Content Extraction

被引:0
|
作者
Wenxing HONG [1 ]
Jie LI [1 ]
Weiwei WANG [1 ]
Yang WENG [2 ]
机构
[1] Automation Department,Xiamen University
[2] College of Mathematics,Sichuan University
基金
国家重点研发计划;
关键词
D O I
10.15878/j.cnki.instrumentation.2019.03.007
中图分类号
TP242.6 [智能机器人]; TP391.1 [文字信息处理];
学科分类号
081104 ; 081203 ; 0835 ;
摘要
The main content of a news web page is a source of data for Natural Language Processing(NLP) and Information Retrieval(IR),which contains large quantities of valuable information.This paper proposes a method that formulates the main content extraction problem as a DOM tree node classification problem.In terms of feature extraction,we use the DOM tree node to represent HTML document and then develop multiple features by using the DOM tree node properties,such as text length,tag path,tag properties and so on.In consideration that the essence of the problem is the classification model,we use Xgboost to help select nodes.Experimental results show that the proposed approach is effective and efficient in extracting main content of new web pages,and achieves about 98% accuracy over 1083 news pages from 10 different new sites,and the average processing time per page is within 10 ms.
引用
收藏
页码:52 / 58
页数:7
相关论文
共 50 条
  • [1] AUTOMATIC CONTENT EXTRACTION ON THE WEB WITH INTELLIGENT ALGORITHMS
    Cababie, Pablo
    Zweig, Alvaro
    Barrera, Gabriel
    Lopez De Luise, Daniela
    ICAART 2011: PROCEEDINGS OF THE 3RD INTERNATIONAL CONFERENCE ON AGENTS AND ARTIFICIAL INTELLIGENCE, VOL 1, 2011, : 452 - 457
  • [2] Automatic Content Extraction on the Web with Intelligent Algorithms
    Cababie, Pablo
    Zweig, Alvaro
    Barrera, Gabriel
    De Luise, Daniela Lopez
    WORLD CONGRESS ON ENGINEERING AND COMPUTER SCIENCE, VOLS 1 AND 2, 2010, : 480 - 484
  • [3] Intelligent knowledge extraction from the web
    Cardeñosa, J
    Tovar, E
    INTERNATIONAL JOURNAL OF UNCERTAINTY FUZZINESS AND KNOWLEDGE-BASED SYSTEMS, 2003, 11 : 117 - 134
  • [4] Extraction of Web Content Based on Content Type
    Verma, Manish Kumar
    Kumar, Sarowar
    Abhishek, Kumar
    Singh, M. P.
    PROCEEDINGS OF INTERNATIONAL CONFERENCE ON ICT FOR SUSTAINABLE DEVELOPMENT, ICT4SD 2015, VOL 1, 2016, 408 : 105 - 113
  • [5] Specialized Web robot for objectionable Web content classification
    Choi, SG
    Han, SW
    Jeong, CY
    Nam, TY
    ENFORMATIKA, VOL 7: IEC 2005 PROCEEDINGS, 2005, : 18 - 21
  • [6] Specialized Web Robot for Objectionable Web Content Classification
    Choi, SuGil
    Han, SeungWan
    Jeong, Chi-Yoon
    Nam, TaekYong
    PROCEEDINGS OF WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY, VOL 7, 2005, 7 : 18 - 21
  • [7] Web Content Extraction Using Clustering with Web Structure
    Huang, Xiaotao
    Gao, Yan
    Huang, Liqun
    Zhang, Zhizhao
    Li, Yuhua
    Wang, Fen
    Kang, Ling
    ADVANCES IN NEURAL NETWORKS, PT I, 2017, 10261 : 95 - 103
  • [8] A New Intelligent Topic Extraction Model on Web
    Xie, Ming
    Wu, Chanle
    Zhang, Yunlu
    JOURNAL OF COMPUTERS, 2011, 6 (03) : 466 - 473
  • [9] An intelligent extracting Web content agent on the Internet
    Lee, HM
    Chen, PJ
    Shih, YJ
    Tsai, YC
    Mao, CH
    KNOWLEDGE-BASED INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT 2, PROCEEDINGS, 2005, 3682 : 1265 - 1271
  • [10] Web Information Extraction for content augmentation
    Janevski, A
    Dimitrova, N
    IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL I AND II, PROCEEDINGS, 2002, : A389 - A392