A Web Page Segmentation Method based on Page Layouts and Title Blocks

被引:0
|
作者
Sano, Hiroyuki [1 ]
Shiramatsu, Shun [1 ]
Ozono, Tadachika [1 ]
Shintani, Toramatsu [1 ]
机构
[1] Nagoya Inst Technol, Grad Sch Engn, Dept Comp Sci & Engn, Nagoya, Aichi 4668555, Japan
关键词
Web page segmentation; Page layout; Title block; Machine learning;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this work, we describe a new Web page segmentation method to extract the semantic structure from a Web page. A typical Web page consists of multiple elements with different functionalities, such as main content, navigation panels, copyright and privacy notices, and advertisements, and Web page segmentation is the division of the page into visually and semantically cohesive pieces. The proposed method is comprised of three steps. First, it determines the layout template of a Web page by template matching. Second, it divides the page into minimum blocks. Third, it assembles groups of these blocks into Web content blocks. While the minimum blocks can play many roles, in this study we have focused on the those that are the titles of various Web content bits. We used decision tree learning with nine parameters for each minimum block to extract the title blocks from Web pages. Experimental results showed that the decision tree generated by the J48 algorithm is the most suitable for this type of extraction.
引用
收藏
页码:84 / 90
页数:7
相关论文
共 50 条
  • [1] A Novel Method for the Web page Segmentation And Identification
    Wang, Jing
    Liu, Zhijing
    [J]. 2009 INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND TECHNOLOGY, VOL I, PROCEEDINGS, 2009, : 229 - 231
  • [2] A method for supporting web page design based on impression of web page
    Watanabe, M
    Yoshida, T
    Saiwaki, N
    Nishida, S
    [J]. IEEE RO-MAN 2000: 9TH IEEE INTERNATIONAL WORKSHOP ON ROBOT AND HUMAN INTERACTIVE COMMUNICATION, PROCEEDINGS, 2000, : 13 - 17
  • [3] Toward semantic annotation of Web page's segmentation blocks
    Cosulschi, Mirel
    [J]. ANNALS OF THE UNIVERSITY OF CRAIOVA-MATHEMATICS AND COMPUTER SCIENCE SERIES, 2010, 37 (03): : 92 - 100
  • [4] Automated Repair of Responsive Web Page Layouts
    Althomali, Ibrahim
    Kapfhammer, Gregory M.
    McMinn, Phil
    [J]. 2022 IEEE 15TH INTERNATIONAL CONFERENCE ON SOFTWARE TESTING, VERIFICATION AND VALIDATION (ICST 2022), 2022, : 140 - 150
  • [5] Web page segmentation based on Gestalt theory
    Xiang, Peifeng
    Yang, Xin
    Shi, Yuanchun
    [J]. 2007 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-5, 2007, : 2253 - 2256
  • [6] Web Page Segmentation Evaluation
    Sanoja, Andres
    Gancarski, Stephane
    [J]. 30TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, VOLS I AND II, 2015, : 753 - 760
  • [7] Content-based Title Extraction from Web Page
    Gali, Najlah
    Franti, Pasi
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON WEB INFORMATION SYSTEMS AND TECHNOLOGIES, VOL 2 (WEBIST), 2016, : 204 - 210
  • [8] Web page dependent vision based segmentation for web sites
    Ko, Pyungkwan
    Kang, Sanggil
    Kumar, Harshit
    [J]. 7TH IEEE/ACIS INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE IN CONJUNCTION WITH 2ND IEEE/ACIS INTERNATIONAL WORKSHOP ON E-ACTIVITY, PROCEEDINGS, 2008, : 690 - +
  • [9] Web Page Segmentation with Structured Prediction and its Application in Web Page Classification
    Bing, Lidong
    Guo, Rui
    Lam, Wai
    Niu, Zheng-Yu
    Wang, Haifeng
    [J]. SIGIR'14: PROCEEDINGS OF THE 37TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2014, : 767 - 776
  • [10] Web page title extraction and its application
    Xue, Yewei
    Hu, Yunhua
    Xin, Guomao
    Song, Ruihua
    Shi, Shuming
    Cao, Yunbo
    Lin, Chin-Yew
    Li, Hang
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2007, 43 (05) : 1332 - 1347