Intelligent Web Robot for Content Extraction

被引：0

作者：

Wenxing HONG ^{[1
]}

Jie LI ^{[1
]}

Weiwei WANG ^{[1
]}

Yang WENG ^{[2
]}

机构：

[1] Automation Department,Xiamen University

[2] College of Mathematics,Sichuan University

来源：

Instrumentation | 2019年 / 6卷 / 03期

基金：

国家重点研发计划;

关键词：

D O I：

10.15878/j.cnki.instrumentation.2019.03.007

中图分类号：

TP242.6 [智能机器人]; TP391.1 [文字信息处理];

学科分类号：

081104 ; 081203 ; 0835 ;

摘要：

The main content of a news web page is a source of data for Natural Language Processing(NLP) and Information Retrieval(IR),which contains large quantities of valuable information.This paper proposes a method that formulates the main content extraction problem as a DOM tree node classification problem.In terms of feature extraction,we use the DOM tree node to represent HTML document and then develop multiple features by using the DOM tree node properties,such as text length,tag path,tag properties and so on.In consideration that the essence of the problem is the classification model,we use Xgboost to help select nodes.Experimental results show that the proposed approach is effective and efficient in extracting main content of new web pages,and achieves about 98% accuracy over 1083 news pages from 10 different new sites,and the average processing time per page is within 10 ms.

引用

页码：52 / 58

页数：7

共 50 条

[1] AUTOMATIC CONTENT EXTRACTION ON THE WEB WITH INTELLIGENT ALGORITHMS
Cababie, Pablo
Zweig, Alvaro
Barrera, Gabriel
Lopez De Luise, Daniela
ICAART 2011: PROCEEDINGS OF THE 3RD INTERNATIONAL CONFERENCE ON AGENTS AND ARTIFICIAL INTELLIGENCE, VOL 1, 2011, : 452 - 457
[2] Automatic Content Extraction on the Web with Intelligent Algorithms
Cababie, Pablo
Zweig, Alvaro
Barrera, Gabriel
De Luise, Daniela Lopez
WORLD CONGRESS ON ENGINEERING AND COMPUTER SCIENCE, VOLS 1 AND 2, 2010, : 480 - 484
[3] Intelligent knowledge extraction from the web
Cardeñosa, J
Tovar, E
INTERNATIONAL JOURNAL OF UNCERTAINTY FUZZINESS AND KNOWLEDGE-BASED SYSTEMS, 2003, 11 : 117 - 134
[4] Extraction of Web Content Based on Content Type
Verma, Manish Kumar
Kumar, Sarowar
Abhishek, Kumar
Singh, M. P.
PROCEEDINGS OF INTERNATIONAL CONFERENCE ON ICT FOR SUSTAINABLE DEVELOPMENT, ICT4SD 2015, VOL 1, 2016, 408 : 105 - 113
[5] Specialized Web robot for objectionable Web content classification
Choi, SG
Han, SW
Jeong, CY
Nam, TY
ENFORMATIKA, VOL 7: IEC 2005 PROCEEDINGS, 2005, : 18 - 21
[6] Specialized Web Robot for Objectionable Web Content Classification
Choi, SuGil
Han, SeungWan
Jeong, Chi-Yoon
Nam, TaekYong
PROCEEDINGS OF WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY, VOL 7, 2005, 7 : 18 - 21
[7] Web Content Extraction Using Clustering with Web Structure
Huang, Xiaotao
Gao, Yan
Huang, Liqun
Zhang, Zhizhao
Li, Yuhua
Wang, Fen
Kang, Ling
ADVANCES IN NEURAL NETWORKS, PT I, 2017, 10261 : 95 - 103
[8] A New Intelligent Topic Extraction Model on Web
Xie, Ming
Wu, Chanle
Zhang, Yunlu
JOURNAL OF COMPUTERS, 2011, 6 (03) : 466 - 473
[9] An intelligent extracting Web content agent on the Internet
Lee, HM
Chen, PJ
Shih, YJ
Tsai, YC
Mao, CH
KNOWLEDGE-BASED INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT 2, PROCEEDINGS, 2005, 3682 : 1265 - 1271
[10] Web Information Extraction for content augmentation
Janevski, A
Dimitrova, N
IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL I AND II, PROCEEDINGS, 2002, : A389 - A392

← 1 2 3 4 5 →