Intelligent Web Robot for Content Extraction

被引：0

作者：

Wenxing HONG ^{[1
]}

Jie LI ^{[1
]}

Weiwei WANG ^{[1
]}

Yang WENG ^{[2
]}

机构：

[1] Automation Department,Xiamen University

[2] College of Mathematics,Sichuan University

来源：

Instrumentation | 2019年 / 6卷 / 03期

基金：

国家重点研发计划;

关键词：

D O I：

10.15878/j.cnki.instrumentation.2019.03.007

中图分类号：

TP242.6 [智能机器人]; TP391.1 [文字信息处理];

学科分类号：

081104 ; 081203 ; 0835 ;

摘要：

The main content of a news web page is a source of data for Natural Language Processing(NLP) and Information Retrieval(IR),which contains large quantities of valuable information.This paper proposes a method that formulates the main content extraction problem as a DOM tree node classification problem.In terms of feature extraction,we use the DOM tree node to represent HTML document and then develop multiple features by using the DOM tree node properties,such as text length,tag path,tag properties and so on.In consideration that the essence of the problem is the classification model,we use Xgboost to help select nodes.Experimental results show that the proposed approach is effective and efficient in extracting main content of new web pages,and achieves about 98% accuracy over 1083 news pages from 10 different new sites,and the average processing time per page is within 10 ms.

引用

页码：52 / 58

页数：7