Extracting Visually Presented Element Relationships from Web Documents

被引:0
|
作者
Burget, Radek [1 ]
Smrz, Pavel [1 ]
机构
[1] Brno Univ Technol, Fac Informat Technol, IT4Innovat Ctr Excellence, Brno, Czech Republic
基金
欧盟第七框架计划;
关键词
Document Analysis; Element Relationships; Logical Document Structure; Page Segmentation; Web Documents;
D O I
10.4018/ijcini.2013040102
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many documents in the World Wide Web present structured information that consists of multiple pieces of data with certain relationships among them. Although it is usually not difficult to identify the individual data values in the document text, their relationships are often not explicitly described in the document content. They are expressed by visual presentation of the document content that is expected to be interpreted by a human reader. In this paper, the authors propose a formal generic model of logical relationships in a document based on an interpretation of visual presentation patterns in the documents. The model describes the visually expressed relationships between individual parts of the contents independently of the document format and the particular way of presentation. Therefore, it can be used as an appropriate document model in many information retrieval or extraction applications. The authors formally define the model, the authors introduce a method of extracting the relationships between the content parts based on the visual presentation analysis and the authors discuss the expected applications. The authors also present a new dataset consisting of programmes of conferences and other scientific events and the authors discuss its suitability for the task in hand. Finally, the authors use the dataset to evaluate results of the implemented system.
引用
收藏
页码:13 / 29
页数:17
相关论文
共 50 条
  • [21] A Study of Extracting Knowledge from Guideline Documents
    Taboada, M.
    Meizoso, M.
    Martinez, D.
    Tellado, S.
    COMPUTER AIDED SYSTEMS THEORY - EUROCAST 2009, 2009, 5717 : 195 - +
  • [22] Extracting indices from Japanese legal documents
    Tho Thi Ngoc Le
    Shirai, Kiyoaki
    Minh Le Nguyen
    Shimazu, Akira
    ARTIFICIAL INTELLIGENCE AND LAW, 2015, 23 (04) : 315 - 344
  • [23] Improving transparency: Extracting, visualising and analysing corporate relationships from SEC 10-K documents
    Gebbie, Michael
    Norlen, Kim
    Lucas, Gabriel
    Chuang, John
    International Journal of Technology, Policy and Management, 2007, 7 (01) : 15 - 31
  • [24] Extracting digital fingerprints from Chinese documents
    Liu, Guo-Hua
    Ma, Hui-Dong
    Li, Xu
    Liang, Peng
    CIS: 2007 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY, PROCEEDINGS, 2007, : 438 - 441
  • [25] Extracting Topical Phrases from Clinical Documents
    He, Yulan
    THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2016, : 2957 - 2963
  • [26] Extracting mathematical expressions from postscript documents
    Department of Precision Machinery and Precision Instrumentation, University of Science and Technology of China, Hefei 230027, China
    不详
    Shu Ju Cai Ji Yu Chu Li, 2008, 4 (454-458):
  • [27] An automated multi-component approach to extracting entity relationships from Database Requirement Specification documents
    Du, Siqing
    Metzler, Douglas P.
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, PROCEEDINGS, 2006, 3999 : 1 - 11
  • [28] A technique for creating visually abstracted geometric thumbnails of web documents for mobile devices
    Kim, Beomjin
    Aeschliman, Benjamin
    International Journal of Digital Content Technology and its Applications, 2012, 6 (13) : 470 - 481
  • [29] Extracting Time Information from Korean Documents
    Lee, Seung-Dong
    Jeong, Young-Seob
    2023 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING, BIGCOMP, 2023, : 407 - 409
  • [30] Extracting mathematical semantics from LATEX documents
    Stuber, J
    van den Brand, M
    PRINCIPLES AND PRACTICE OF SEMANTIC WEB REASONING, 2003, 2901 : 160 - 173