Extracting content structure for web pages based on visual representation

被引:0
|
作者
Cai, D
Yu, SP
Wen, JR
Ma, WY
机构
[1] Tsinghua Univ, Beijing 100084, Peoples R China
[2] Peking Univ, Beijing 100871, Peoples R China
来源
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A new web content structure based on visual representation is proposed in this paper. Many web applications such as information retrieval, information extraction and automatic page adaptation can benefit from this structure. This paper presents an automatic top-down, tag-tree independent approach to detect web content structure. It simulates how a user understands web layout structure based on his visual perception. Comparing to other existing techniques, our approach is independent to underlying documentation representation such as HTML and works well even when the HTML structure is far different from layout structure. Experiments show satisfactory results.
引用
收藏
页码:406 / 417
页数:12
相关论文
共 50 条
  • [1] Extracting News Content with Visual Unit of Web Pages
    Zhu, Wenhao
    Dai, Song
    Song, Yang
    Lu, Zhiguo
    [J]. 2015 16TH IEEE/ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING (SNPD), 2015, : 211 - 215
  • [2] Extracting Content for News Web Pages based on DOM
    Geng, Hua
    Gao, Qiang
    Pan, Jingui
    [J]. INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2007, 7 (02): : 124 - 129
  • [3] Extracting Topic Maps from Web Pages by Web Link Structure and Content
    Mase, Motohiro
    Yamada, Seiji
    Nitta, Katsumi
    [J]. 2008 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION, VOLS 1-8, 2008, : 1232 - +
  • [4] Improving the web text content by extracting significant pages into a Web Site
    Ríos, SA
    Velásquez, JD
    Vera, ES
    Yasuda, H
    Aoki, T
    [J]. 5th International Conference on Intelligent Systems Design and Applications, Proceedings, 2005, : 32 - 36
  • [5] The Role of Structure and Content in Perception of Visual Similarity Between Web Pages
    Song, Guangfeng
    [J]. INTERNATIONAL JOURNAL OF HUMAN-COMPUTER INTERACTION, 2011, 27 (08) : 793 - 816
  • [6] Robin: Extracting visual and textual features from web pages
    Oka, M
    Tsukada, H
    Kato, K
    [J]. FRONTIERS OF WWW RESEARCH AND DEVELOPMENT - APWEB 2006, PROCEEDINGS, 2006, 3841 : 765 - 771
  • [7] LBDA: A NOVEL FRAMEWORK FOR EXTRACTING CONTENT FROM WEB PAGES
    Vijendran, Anna Saro
    Deepa, C.
    [J]. PROCEEDINGS OF THE 2013 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING & COMMUNICATION SYSTEMS (ICACCS), 2013,
  • [8] A hybrid approach for extracting informative content from web pages
    Uzun, Erdinc
    Agun, Hayri Volkan
    Yerlikaya, Tarik
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2013, 49 (04) : 928 - 944
  • [9] EXTRACTING THE SEMANTIC CONTENT OF WEB PAGES VIA REPEATED STRUCTURES
    He, Zheng
    Luo, Hangzai
    Fan, Jianping
    Liu, Xiao
    [J]. ELECTRONIC PROCEEDINGS OF THE 2013 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS (ICMEW), 2013,
  • [10] A Rule Based DFA Driven Information Extractor for Content Extracting from Web Pages
    Liu, Jin
    Chu, Danliang
    Song, Junjie
    Zhong, Bei
    Cai, Biqi
    [J]. INTELLIGENT SYSTEMS AND APPLICATIONS (ICS 2014), 2015, 274 : 482 - 488