SEMI-STRUCTURED DOCUMENT EXTRACTION BASED ON DOCUMENT ELEMENT BLOCK MODEL

被引:0
|
作者
Lv, Tao [1 ,4 ]
Liu, Jiang [1 ,4 ]
Lu, Fan [2 ]
Zhang, Peng [2 ]
Wang, Xinyan [3 ]
Wang, Cong [1 ,4 ]
机构
[1] Beijing Univ Posts & Telecommun, Sch Software Engn, Beijing 100876, Peoples R China
[2] Minist Sci & Technol, Beijing 100862, Peoples R China
[3] Air Force Gen Hosp, Beijing 100142, Peoples R China
[4] Beijing Univ Posts & Telecommun, Key Lab Trustworthy Distributed Comp & Serv, Beijing 100876, Peoples R China
关键词
Semi-structured document; Document extraction; Regular expression;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A large number of documents related to its specific business are produced continually by enterprises and institutions in their daily work, To get useful information from these semi-structured documents we have proposed document element block model(DEBM) and applied it in the semi-structured document extraction. The model makes full use of the information contains in the document, not only the structural information, but also the content. DEBM extracts document element block from template documents and target documents, and then generate corresponding regular expression rules based on the document element block characteristic of template document, after that process each type of document elements of a set of blocks extracted document elements according to the corresponding elements block position by regular expression matching. The experiments show that extraction based on DEBM achieved good results and compared to traditional regular expressions and template matching, the approach based on DEBM performs better. The result shows that we propose a simple, efficient model to extract semi-structured documents,
引用
收藏
页码:461 / 465
页数:5
相关论文
共 50 条
  • [22] Incremental Discovery of Sequential Pattern from Semi-structured Document Using Grammatical Inference
    Thakur, Ramesh
    Jain, Suresh
    Chaudhari, Narendra S.
    DISTRIBUTED COMPUTING AND INTERNET TECHNOLOGY, 2012, 7154 : 269 - 269
  • [23] Self-paced Compensatory Deep Boltzmann Machine for Semi-Structured Document Embedding
    Li, Shuangyin
    Pan, Rong
    Yan, Jun
    PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 2187 - 2193
  • [24] Chinese resume information extraction based on semi-structured text
    Wentan, Yan
    Yupeng, Qiao
    Chinese Control Conference, CCC, 2017, : 11177 - 11182
  • [25] Chinese resume information extraction based on semi-structured text
    Yan Wentan
    Qiao Yupeng
    PROCEEDINGS OF THE 36TH CHINESE CONTROL CONFERENCE (CCC 2017), 2017, : 11177 - 11182
  • [27] Information Extraction of Strategic Activities based on Semi-structured Text
    Ma, Xubu
    Guo, Ju-E
    Ma, Xubu
    2014 SEVENTH INTERNATIONAL JOINT CONFERENCE ON COMPUTATIONAL SCIENCES AND OPTIMIZATION (CSO), 2014, : 579 - 583
  • [28] Automatic Content Extraction on Semi-Structured Documents
    dos Santos, Jose Eduardo Bastos
    11TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR 2011), 2011, : 1235 - 1239
  • [29] Dynamic element retrieval in a semi-structured collection
    Crouch, Carolyn J.
    Crouch, Donald B.
    Ganapathibhotla, Murthy
    Bakshi, Vishal
    COMPARATIVE EVALUATION OF XML INFORMATION RETRIEVAL SYSTEMS, 2007, 4518 : 82 - 88
  • [30] Low-Dimensionality Information Extraction Model for Semi-structured Documents
    Belhadj, Djedjiga
    Belaid, Abdel
    Belaid, Yolande
    COMPUTER ANALYSIS OF IMAGES AND PATTERNS, CAIP 2023, PT I, 2023, 14184 : 76 - 85