Extracting Visually Presented Element Relationships from Web Documents

被引:0
|
作者
Burget, Radek [1 ]
Smrz, Pavel [1 ]
机构
[1] Brno Univ Technol, Fac Informat Technol, IT4Innovat Ctr Excellence, Brno, Czech Republic
基金
欧盟第七框架计划;
关键词
Document Analysis; Element Relationships; Logical Document Structure; Page Segmentation; Web Documents;
D O I
10.4018/ijcini.2013040102
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many documents in the World Wide Web present structured information that consists of multiple pieces of data with certain relationships among them. Although it is usually not difficult to identify the individual data values in the document text, their relationships are often not explicitly described in the document content. They are expressed by visual presentation of the document content that is expected to be interpreted by a human reader. In this paper, the authors propose a formal generic model of logical relationships in a document based on an interpretation of visual presentation patterns in the documents. The model describes the visually expressed relationships between individual parts of the contents independently of the document format and the particular way of presentation. Therefore, it can be used as an appropriate document model in many information retrieval or extraction applications. The authors formally define the model, the authors introduce a method of extracting the relationships between the content parts based on the visual presentation analysis and the authors discuss the expected applications. The authors also present a new dataset consisting of programmes of conferences and other scientific events and the authors discuss its suitability for the task in hand. Finally, the authors use the dataset to evaluate results of the implemented system.
引用
收藏
页码:13 / 29
页数:17
相关论文
共 50 条
  • [1] Extracting conceptual relationships from specialized documents
    Hui, B
    Yu, E
    DATA & KNOWLEDGE ENGINEERING, 2005, 54 (01) : 29 - 55
  • [2] Extracting conceptual relationships from specialized documents
    Hui, B
    Yu, E
    CONCEPTUAL MODELING - ER 2002, 2002, 2503 : 232 - 246
  • [3] Visually Extracting Data Records from the Deep Web
    Anderson, Neil
    Hong, Jun
    PROCEEDINGS OF THE 22ND INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'13 COMPANION), 2013, : 1233 - 1238
  • [4] Extracting Threshold Conceptual Structures from Web Documents
    Ciobanu, Gabriel
    Horne, Ross
    Vaideanu, Cristian
    GRAPH-BASED REPRESENTATION AND REASONING, 2014, 8577 : 130 - 144
  • [5] Extracting semantic relationships between terms from PC documents and its applications to web search personalization
    Ohshima, H
    Oyama, S
    Tanaka, K
    FRONTIERS OF WWW RESEARCH AND DEVELOPMENT - APWEB 2006, PROCEEDINGS, 2006, 3841 : 579 - 590
  • [6] SNExtractor: A Prototype for Extracting Semantic Networks from Web Documents
    Zhang, Chi
    Wang, Yanhua
    Wang, Chengyu
    Cheng, Wenliang
    He, Xiaofeng
    WEB-AGE INFORMATION MANAGEMENT, PT II, 2016, 9659 : 527 - 530
  • [7] Extracting instances of relations from Web documents using redundancy
    de Boer, Viktor
    van Someren, Maarten
    Wielinga, Bob J.
    SEMANTIC WEB: RESEARCH AND APPLICATIONS, PROCEEDINGS, 2006, 4011 : 245 - 258
  • [8] Extracting news text from web pages: an application for the visually impaired
    Lundgren, Erik
    Papapetrou, Panagiotis
    Asker, Lars
    8TH ACM INTERNATIONAL CONFERENCE ON PERVASIVE TECHNOLOGIES RELATED TO ASSISTIVE ENVIRONMENTS (PETRA 2015), 2015,
  • [9] Extracting the Latent Hierarchical Structure of Web Documents
    El-Shayeb, Michael A.
    El-Beltagy, Samhaa R.
    Rafea, Ahmed
    ADVANCED INTERNET BASED SYSTEMS AND APPLICATIONS, 2009, 4879 : 305 - +
  • [10] Extracting Relations from Chinese Web Documents Using Kernel Methods
    Qiu, Jing
    Liao, Lejian
    PROCEEDINGS OF THE 8TH IEEE/ACIS INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE, 2009, : 352 - 356