Intelligent Web Robot for Content Extraction

被引：0

作者：

Wenxing HONG ^{[1
]}

Jie LI ^{[1
]}

Weiwei WANG ^{[1
]}

Yang WENG ^{[2
]}

机构：

[1] Automation Department,Xiamen University

[2] College of Mathematics,Sichuan University

来源：

Instrumentation | 2019年 / 6卷 / 03期

基金：

国家重点研发计划;

关键词：

D O I：

10.15878/j.cnki.instrumentation.2019.03.007

中图分类号：

TP242.6 [智能机器人]; TP391.1 [文字信息处理];

学科分类号：

081104 ; 081203 ; 0835 ;

摘要：

The main content of a news web page is a source of data for Natural Language Processing(NLP) and Information Retrieval(IR),which contains large quantities of valuable information.This paper proposes a method that formulates the main content extraction problem as a DOM tree node classification problem.In terms of feature extraction,we use the DOM tree node to represent HTML document and then develop multiple features by using the DOM tree node properties,such as text length,tag path,tag properties and so on.In consideration that the essence of the problem is the classification model,we use Xgboost to help select nodes.Experimental results show that the proposed approach is effective and efficient in extracting main content of new web pages,and achieves about 98% accuracy over 1083 news pages from 10 different new sites,and the average processing time per page is within 10 ms.

引用

页码：52 / 58

页数：7

共 50 条

[31] The Intelligent Extraction of Academic Information Based on Web Service Discovery
Xu, Meihui
Wang, Yaqi
Zhang, Jingxiang
2013 IEEE NINTH INTERNATIONAL CONFERENCE ON MOBILE AD-HOC AND SENSOR NETWORKS (MSN 2013), 2013, : 399 - 404
[32] A Novel Approach for Content Extraction from Web Pages
Bhardwaj, Aanshi
Mangat, Veenu
2014 RECENT ADVANCES IN ENGINEERING AND COMPUTATIONAL SCIENCES (RAECS), 2014,
[33] The Web as a database new extraction technologies & content management
Adams, KC
ONLINE, 2001, 25 (02): : 27 - +
[34] A new approach to content extraction from web page
Song, Ming-Qiu
Zhang, Rui-Xue
Wu, Xin-Tao
Li, Wen-Li
Dalian Ligong Daxue Xuebao/Journal of Dalian University of Technology, 2009, 49 (04): : 594 - 597
[35] A Comprehensive Survey on Web Content Extraction Algorithms and Techniques
AL-Ghuribi, Sumaia Mohammed
Alshomrani, Saleh
2013 INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND APPLICATIONS (ICISA 2013), 2013,
[36] DOM Tree Based Approach for Web Content Extraction
Mehta, Bhavdeep
Narvekar, Meera
2015 International Conference on Communication, Information & Computing Technology (ICCICT), 2015,
[37] A Method for Web Content Extraction and Analysis in the Tourism Domain
Oro, Ermelinda
Ruffolo, Massimo
ICEIS: PROCEEDINGS OF THE 19TH INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS - VOL 1, 2017, : 365 - 370
[38] Synonyms extraction using Web content focused crawling
Chen, Chien-Hsing
Hsu, Chung-Chian
INFORMATION RETRIEVAL TECHNOLOGY, 2008, 4993 : 286 - 297
[39] Cooperative CG-Wrappers for web content extraction
Kokkoras, Fotis
Bassiliades, Nick
Vlahavas, Ioannis
CONCEPTUAL STRUCTURES: KNOWLEDGE ARCHITECTURES FOR SMART APPLICATIONS, PROCEEDINGS, 2007, 4604 : 476 - +
[40] Automatic Web Content Extraction by Combination of Learning and Grouping
Wu, Shanchan
Liu, Jerry
Fan, Jian
PROCEEDINGS OF THE 24TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW 2015), 2015, : 1264 - 1274

← 1 2 3 4 5 →