TPS: an Unsupervised Web Page Segmentation Algorithm Based on Dom Tree Structure Mining

被引:0
|
作者
Li, Chunshan [1 ]
Ye, Yunming [1 ]
Zhang, Xiaofeng [1 ]
机构
[1] Harbin Inst Technol, Dept Comp Sci & Technol, Shenzhen Grad Sch, Shenzhen 518055, Peoples R China
关键词
Page Segmentation; DOM Tree; Web Structure Mining;
D O I
暂无
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Segmenting web pages into small modules that match user's intuitive sense is an important preprocessing step in mobile device browsing, information retrieval data extraction applications. Traditional page segmentation algorithms usually exploit some heuristic information of page content and DOM tree, such as visual clues or attributes (tags) in DOM trees, but ignore some useful features contained in both sub-tree structures of DOM trees and the semantic content of pages, which in turn leads to poor performance in segmentation of complex web pages. In this paper, we present a novel unsupervised page segmentation algorithm, i.e. TPS, to exploit richer features in DOM trees. This algorithm can successfully bridge the gap between the DOM structure and the semantic modules, and identify modules by mining the sub-tree structures of DOM trees. Experimental results on various web pages demonstrate that TPS has better performance than start-of-the-art algorithm VIPS.
引用
收藏
页码:387 / 394
页数:8
相关论文
共 50 条
  • [1] The Technology of Extracting Content Information from Web Page Based on DOM Tree
    Yuan, Dingrong
    Mo, Zhuoying
    Xie, Bing
    Xie, Yangcai
    [J]. ADVANCED RESEARCH ON ELECTRONIC COMMERCE, WEB APPLICATION, AND COMMUNICATION, PT 2, 2011, 144 : 271 - 278
  • [2] Unsupervised vector image segmentation by a tree structure - ICM algorithm
    Fwu, JK
    Djuric, PM
    [J]. IEEE TRANSACTIONS ON MEDICAL IMAGING, 1996, 15 (06) : 871 - 880
  • [3] A Block Gathering Based on Mobile Web Page Segmentation Algorithm
    Wu, Libing
    Ke, Yalin
    He, Yanxiang
    Liu, Nan
    [J]. TRUSTCOM 2011: 2011 INTERNATIONAL JOINT CONFERENCE OF IEEE TRUSTCOM-11/IEEE ICESS-11/FCST-11, 2011, : 1425 - 1430
  • [4] A web page segmentation algorithm based on Iterated Dividing and Shrinking
    Cao Jiuxin
    Mao Bo
    Luo Junzhou
    [J]. 2007 IFIP INTERNATIONAL CONFERENCE ON NETWORK AND PARALLEL COMPUTING WORKSHOPS, PROCEEDINGS, 2007, : 701 - 705
  • [5] A Chinese Web Page Clustering Algorithm Based on the Suffix Tree
    YANG Jian-wu National Key Laboratory for Text Processing
    [J]. Wuhan University Journal of Natural Sciences, 2004, (05) : 817 - 822
  • [6] Unsupervised vector image segmentation by a tree structure-ICM algorithm
    State Univ of New York at Stony, Brook, Stony Brook, United States
    [J]. IEEE Trans Med Imaging, 6 (871-880):
  • [7] A Topic-Specific Web Crawler with Web Page Hierarchy Based on HTML']HTML Dom-Tree
    Yang, Yuekui
    Du, Yajun
    Hai, Yufeng
    Gao, Zhaoqiong
    [J]. 2009 ASIA-PACIFIC CONFERENCE ON INFORMATION PROCESSING (APCIP 2009), VOL 1, PROCEEDINGS, 2009, : 420 - 423
  • [8] Web Page Segmentation Using Block Function Tree
    Orogat, Abdelghny
    Hemeda, Hamed
    Ahmed, M. T. Faheem Said
    [J]. 7TH IEEE ANNUAL INFORMATION TECHNOLOGY, ELECTRONICS & MOBILE COMMUNICATION CONFERENCE IEEE IEMCON-2016, 2016,
  • [9] Towards an Improved Vision-based Web Page Segmentation Algorithm
    Cormier, Michael
    Mann, Richard
    Moffatt, Karyn
    Cohen, Robin
    [J]. 2017 14TH CONFERENCE ON COMPUTER AND ROBOT VISION (CRV 2017), 2017, : 345 - 352
  • [10] A method based on node density segmentation and label propagation for mining web page
    College of Computer and Information Engineering, Henan University of Economics and Law, Zhengzhou
    450002, China
    不详
    430072, China
    [J]. Jisuanji Xuebao, 2 (349-364):