Extracting Visually Presented Element Relationships from Web Documents

被引:0
|
作者
Burget, Radek [1 ]
Smrz, Pavel [1 ]
机构
[1] Brno Univ Technol, Fac Informat Technol, IT4Innovat Ctr Excellence, Brno, Czech Republic
基金
欧盟第七框架计划;
关键词
Document Analysis; Element Relationships; Logical Document Structure; Page Segmentation; Web Documents;
D O I
10.4018/ijcini.2013040102
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many documents in the World Wide Web present structured information that consists of multiple pieces of data with certain relationships among them. Although it is usually not difficult to identify the individual data values in the document text, their relationships are often not explicitly described in the document content. They are expressed by visual presentation of the document content that is expected to be interpreted by a human reader. In this paper, the authors propose a formal generic model of logical relationships in a document based on an interpretation of visual presentation patterns in the documents. The model describes the visually expressed relationships between individual parts of the contents independently of the document format and the particular way of presentation. Therefore, it can be used as an appropriate document model in many information retrieval or extraction applications. The authors formally define the model, the authors introduce a method of extracting the relationships between the content parts based on the visual presentation analysis and the authors discuss the expected applications. The authors also present a new dataset consisting of programmes of conferences and other scientific events and the authors discuss its suitability for the task in hand. Finally, the authors use the dataset to evaluate results of the implemented system.
引用
收藏
页码:13 / 29
页数:17
相关论文
共 50 条
  • [31] Learning non-taxonomic relationships from web documents for domain ontology construction
    Sanchez, David
    Moreno, Antonio
    DATA & KNOWLEDGE ENGINEERING, 2008, 64 (03) : 600 - 623
  • [32] Extracting riches from the Web: Web mining/personalization
    Drogan, M
    Hsu, J
    7TH WORLD MULTICONFERENCE ON SYSTEMICS, CYBERNETICS AND INFORMATICS, VOL XVI, PROCEEDINGS: SYSTEMICS AND INFORMATION SYSTEMS, TECHNOLOGIES AND APPLICATION, 2003, : 214 - 219
  • [33] Extracting Company Information from the Web
    Lam, Man I.
    Gong, Zhiguo
    Guo, Jingzhi
    2009 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS (SMC 2009), VOLS 1-9, 2009, : 3640 - 3645
  • [34] Extracting logical schema from the web
    Carchiolo, V
    Longheu, A
    Malgeri, M
    APPLIED INTELLIGENCE, 2003, 18 (03) : 341 - 355
  • [35] Extracting World Knowledge from the Web
    Yates, Alexander
    COMPUTER, 2009, 42 (06) : 94 - 97
  • [36] Extracting Templates from Web pages
    Manjula, R.
    Chilambuchelvan, A.
    2013 INTERNATIONAL CONFERENCE ON GREEN COMPUTING, COMMUNICATION AND CONSERVATION OF ENERGY (ICGCE), 2013, : 788 - 791
  • [37] Extracting Knowledge from Web Data
    Ezzikouri, Hanane
    Fakir, Mohamed
    Daoui, Cherki
    Erritali, Mohamed
    JOURNAL OF INFORMATION TECHNOLOGY RESEARCH, 2014, 7 (04) : 27 - 41
  • [38] Extracting Term Relationships from Wikipedia
    Mathiak, Brigitte
    Pena, Victor Manuel Martinez
    Wira-Alam, Andias
    WEB INFORMATION SYSTEMS AND TECHNOLOGIES, WEBIST 2012, 2013, 140 : 267 - 280
  • [39] Extracting table information from the Web
    Kim, YS
    Lee, KH
    DOCUMENT ANALYSIS SYSTEMS VI, PROCEEDINGS, 2004, 3163 : 438 - 441
  • [40] Extracting Comparative Commonsense from the Web
    Cao, Yanan
    Cao, Cungen
    Zang, Liangjun
    Wang, Shi
    Wang, Dongsheng
    INTELLIGENT INFORMATION PROCESSING V, 2010, 340 : 154 - 162