Extracting Visually Presented Element Relationships from Web Documents

被引:0
|
作者
Burget, Radek [1 ]
Smrz, Pavel [1 ]
机构
[1] Brno Univ Technol, Fac Informat Technol, IT4Innovat Ctr Excellence, Brno, Czech Republic
基金
欧盟第七框架计划;
关键词
Document Analysis; Element Relationships; Logical Document Structure; Page Segmentation; Web Documents;
D O I
10.4018/ijcini.2013040102
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many documents in the World Wide Web present structured information that consists of multiple pieces of data with certain relationships among them. Although it is usually not difficult to identify the individual data values in the document text, their relationships are often not explicitly described in the document content. They are expressed by visual presentation of the document content that is expected to be interpreted by a human reader. In this paper, the authors propose a formal generic model of logical relationships in a document based on an interpretation of visual presentation patterns in the documents. The model describes the visually expressed relationships between individual parts of the contents independently of the document format and the particular way of presentation. Therefore, it can be used as an appropriate document model in many information retrieval or extraction applications. The authors formally define the model, the authors introduce a method of extracting the relationships between the content parts based on the visual presentation analysis and the authors discuss the expected applications. The authors also present a new dataset consisting of programmes of conferences and other scientific events and the authors discuss its suitability for the task in hand. Finally, the authors use the dataset to evaluate results of the implemented system.
引用
收藏
页码:13 / 29
页数:17
相关论文
共 50 条
  • [41] Extracting Logical Schema from the Web
    Vincenza Carchiolo
    Alessandro Longheu
    Michele Malgeri
    Applied Intelligence, 2003, 18 : 341 - 355
  • [42] Extracting semistructured information from Web
    Huang, Yu-Qing
    Qi, Guang-Zhi
    Zhang, Fu-Yan
    Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design & Computer Graphics, 2000, 12 (03): : 230 - 234
  • [43] Extracting spatial knowledge from the Web
    Morimoto, Y
    Aono, M
    Houle, ME
    McCurley, KS
    2003 SYMPOSIUM ON APPLICATIONS AND THE INTERNET, PROCEEDINGS, 2003, : 326 - 333
  • [44] Extracting bilingual terms from the Web
    Gaizauskas, Robert
    Paramita, Monica Lestari
    Barker, Emma
    Pinnis, Marcis
    Aker, Ahmet
    Pahisa Sole, Marta
    TERMINOLOGY, 2015, 21 (02): : 205 - 236
  • [45] Extracting Interlinear Glossed Text from LATEX Documents
    Schenner, Mathias
    Nordhoff, Sebastian
    LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 4044 - 4048
  • [46] A novel approach for extracting text from color documents
    Annamalai University, Annamalai Nagar, Tamil Nadu, India
    World Acad. Sci. Eng. Technol., 2009, (1147-1152):
  • [47] Extracting Hyponymy of Ontology Concepts from Patent Documents
    Li, Junfeng
    Lv, Xueqiang
    Liu, Kehui
    2014 TENTH INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY (CIS), 2014, : 283 - 287
  • [48] A linguistic and statistical approach for extracting knowledge from documents
    Sado, WN
    Fontaine, D
    Fontaine, P
    15TH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2004, : 454 - 458
  • [49] Extracting variable knowledge from multiversioned XML documents
    Rusu, Laura Irina
    Rahayu, Wenny
    Taniar, David
    ICDM 2006: Sixth IEEE International Conference on Data Mining, Workshops, 2006, : 70 - 74
  • [50] A METHOD FOR EXTRACTING WATERMARKS FROM TEXTURED PRINTED DOCUMENTS
    Sergeyev, V. V.
    Fedoseev, V. A.
    COMPUTER OPTICS, 2014, 38 (04) : 825 - 832