SEMI-STRUCTURED DOCUMENT EXTRACTION BASED ON DOCUMENT ELEMENT BLOCK MODEL

被引:0
|
作者
Lv, Tao [1 ,4 ]
Liu, Jiang [1 ,4 ]
Lu, Fan [2 ]
Zhang, Peng [2 ]
Wang, Xinyan [3 ]
Wang, Cong [1 ,4 ]
机构
[1] Beijing Univ Posts & Telecommun, Sch Software Engn, Beijing 100876, Peoples R China
[2] Minist Sci & Technol, Beijing 100862, Peoples R China
[3] Air Force Gen Hosp, Beijing 100142, Peoples R China
[4] Beijing Univ Posts & Telecommun, Key Lab Trustworthy Distributed Comp & Serv, Beijing 100876, Peoples R China
关键词
Semi-structured document; Document extraction; Regular expression;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A large number of documents related to its specific business are produced continually by enterprises and institutions in their daily work, To get useful information from these semi-structured documents we have proposed document element block model(DEBM) and applied it in the semi-structured document extraction. The model makes full use of the information contains in the document, not only the structural information, but also the content. DEBM extracts document element block from template documents and target documents, and then generate corresponding regular expression rules based on the document element block characteristic of template document, after that process each type of document elements of a set of blocks extracted document elements according to the corresponding elements block position by regular expression matching. The experiments show that extraction based on DEBM achieved good results and compared to traditional regular expressions and template matching, the approach based on DEBM performs better. The result shows that we propose a simple, efficient model to extract semi-structured documents,
引用
收藏
页码:461 / 465
页数:5
相关论文
共 50 条
  • [41] Information extraction from semi-structured web documents
    Yun, Bo-Hyun
    Seo, Chang-Ho
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, 2006, 4092 : 586 - 598
  • [42] Learning Semi-Structured Document Categorization Using Bounded-Length Spectrum Sub-Sequence Kernels
    Olivier de Vel
    Data Mining and Knowledge Discovery, 2006, 13 : 309 - 334
  • [43] Learning semi-structured document categorization using bounded-length spectrum sub-sequence kernels
    De Vel, Olivier
    DATA MINING AND KNOWLEDGE DISCOVERY, 2006, 13 (03) : 309 - 334
  • [44] Semi-structured data extraction and modelling: the WIA Project
    Colombo, Gianluca
    Colombo, Ettore
    Bonomi, Andrea
    Mosca, Alessandro
    Bassis, Simone
    ELECTRONIC PROCEEDINGS IN THEORETICAL COMPUTER SCIENCE, 2013, (130): : 98 - 103
  • [45] Interactive Data Extraction from Semi-Structured Text
    Broman, Per
    Thalheim, Bernhard
    INFORMATION MODELLING AND KNOWLEDGE BASES XXIII, 2012, 237 : 1 - 19
  • [46] Semi-structured data extraction and schema knowledge mining
    Chen, E.
    Wang, X.
    High Technology Letters, 2001, 7 (01) : 1 - 5
  • [47] Interactive tuples extraction from semi-structured data
    Gilleron, Remi
    Marty, Patrick
    Tommasi, Marc
    Torre, Fabien
    2006 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, (WI 2006 MAIN CONFERENCE PROCEEDINGS), 2006, : 997 - 1004
  • [48] A knowledge-based information extraction system for semi-structured labeled documents
    Yang, JY
    Oh, H
    Doh, KG
    Choi, J
    INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2002, 2002, 2412 : 105 - 110
  • [49] Dimensions of ignorance in a semi-structured data model
    Magnani, M
    Montesi, D
    15TH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2004, : 933 - 937
  • [50] Structured Document Model in Digital Community
    Xiong Jin Bo
    Liu Xi Meng
    Jin Biao
    INFORMATION TECHNOLOGY APPLICATIONS IN INDUSTRY II, PTS 1-4, 2013, 411-414 : 199 - +