Intelligent Web Robot for Content Extraction

被引:0
|
作者
Wenxing HONG [1 ]
Jie LI [1 ]
Weiwei WANG [1 ]
Yang WENG [2 ]
机构
[1] Automation Department,Xiamen University
[2] College of Mathematics,Sichuan University
基金
国家重点研发计划;
关键词
D O I
10.15878/j.cnki.instrumentation.2019.03.007
中图分类号
TP242.6 [智能机器人]; TP391.1 [文字信息处理];
学科分类号
081104 ; 081203 ; 0835 ;
摘要
The main content of a news web page is a source of data for Natural Language Processing(NLP) and Information Retrieval(IR),which contains large quantities of valuable information.This paper proposes a method that formulates the main content extraction problem as a DOM tree node classification problem.In terms of feature extraction,we use the DOM tree node to represent HTML document and then develop multiple features by using the DOM tree node properties,such as text length,tag path,tag properties and so on.In consideration that the essence of the problem is the classification model,we use Xgboost to help select nodes.Experimental results show that the proposed approach is effective and efficient in extracting main content of new web pages,and achieves about 98% accuracy over 1083 news pages from 10 different new sites,and the average processing time per page is within 10 ms.
引用
收藏
页码:52 / 58
页数:7
相关论文
共 50 条
  • [41] The web as a database: New extraction technologies and content management
    Adams, K.C.
    2001, Online Inc. (25):
  • [42] Extraction of core web content from web pages using noise elimination
    Saravanan A.
    Bama S.S.
    Journal of Engineering Science and Technology Review, 2020, 13 (04) : 173 - 187
  • [43] Towards more personalized Web: Extraction and integration of dynamic content from the Web
    Kowalkiewicz, M
    Orlowska, ME
    Kaczmarek, T
    Abramowicz, W
    FRONTIERS OF WWW RESEARCH AND DEVELOPMENT - APWEB 2006, PROCEEDINGS, 2006, 3841 : 668 - 679
  • [44] Entropy based Informative Content Density Approach for Efficient Web Content Extraction
    Annam, Manjusha
    Sajeev, G. P.
    2016 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2016, : 118 - 124
  • [45] An intelligent multimedia information system for multimodal content extraction and querying
    Yazici, Adnan
    Koyuncu, Murat
    Yilmaz, Turgay
    Sattari, Saeid
    Sert, Mustafa
    Gulen, Elvan
    MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (02) : 2225 - 2260
  • [46] An intelligent multimedia information system for multimodal content extraction and querying
    Adnan Yazici
    Murat Koyuncu
    Turgay Yilmaz
    Saeid Sattari
    Mustafa Sert
    Elvan Gulen
    Multimedia Tools and Applications, 2018, 77 : 2225 - 2260
  • [47] Intelligent Facial Expression Recognition with Adaptive Feature Extraction for a Humanoid Robot
    Mistry, Kamlesh
    Zhang, Li
    Barnden, John
    2015 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2015,
  • [48] iLSE: An Intelligent Web-based System for Log Structuring and Extraction
    Serasinghe, Sahan
    Shen, Haifeng
    Chen, David
    2017 24TH ASIA-PACIFIC SOFTWARE ENGINEERING CONFERENCE (APSEC 2017), 2017, : 588 - 593
  • [49] An intelligent web spider for online e-commerce data on extraction
    Murali, Ranjani
    PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE ON GREEN COMPUTING AND INTERNET OF THINGS (ICGCIOT 2018), 2018, : 332 - 339
  • [50] An Intelligent and Automated Web Data Extraction System for E-commerce
    Munot, Atharv V.
    Bora, Prashant P.
    Durgude, Shubham
    SMART TRENDS IN COMPUTING AND COMMUNICATIONS, VOL 5, SMARTCOM 2024, 2024, 949 : 329 - 337