Header metadata extraction from semi-structured documents using template matching

被引:0
|
作者
Huang, Zewu [1 ]
Jin, Hai [1 ]
Yuan, Pingpeng [1 ]
Han, Zongfen [1 ]
机构
[1] Huazhong Univ Sci & Technol, Cluster & Grid Comp Lab, Wuhan 430074, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the recent proliferation of documents, automatic metadata extraction from document becomes an important task. In this paper, we propose a novel template matching based method for header metadata extraction form semi-structured documents stored in PDF. In our approach, templates are defined, and the document is considered as strings with format. Templates are used to guide finite state automaton (FSA) to extract header metadata of papers. ne testing results indicate that our approach can effectively extract metadata, without any training cost and available to some special situation. This approach can effectively assist the automatic index creation in lots of fields such as digital libraries, information retrieval, and data mining.
引用
收藏
页码:1776 / +
页数:3
相关论文
共 50 条
  • [21] Towards the automated verification of semi-structured documents
    Weitl, Franz
    Jaksic, Mirjana
    Freitag, Burkhard
    DATA & KNOWLEDGE ENGINEERING, 2009, 68 (03) : 292 - 317
  • [22] Business information extraction from semi-structured webpages
    Sung, NH
    Chang, YS
    EXPERT SYSTEMS WITH APPLICATIONS, 2004, 26 (04) : 575 - 582
  • [23] Interactive Data Extraction from Semi-Structured Text
    Broman, Per
    Thalheim, Bernhard
    INFORMATION MODELLING AND KNOWLEDGE BASES XXIII, 2012, 237 : 1 - 19
  • [24] Interactive tuples extraction from semi-structured data
    Gilleron, Remi
    Marty, Patrick
    Tommasi, Marc
    Torre, Fabien
    2006 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, (WI 2006 MAIN CONFERENCE PROCEEDINGS), 2006, : 997 - 1004
  • [25] Information extraction from Web pages using semi-structured data alignment
    Kuboyama, Tetsuji
    Miyahara, Tetsuhiro
    Hirokawa, Sachio
    Itou, Eisuke
    WMSCI 2005: 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Vol 1, 2005, : 42 - 47
  • [26] Using ILP to construct features for information extraction from semi-structured text
    Ramakrishnan, Ganesh
    Joshil, Sachindra
    Balakrishnan, Sreeram
    Srinivasan, Ashwin
    INDUCTIVE LOGIC PROGRAMMING, 2008, 4894 : 211 - 224
  • [27] Transformation rules from semi-structured XML documents to database model
    Badr, Y
    Sayah, M
    Laforest, F
    Flory, A
    ACS/IEEE INTERNATIONAL CONFERENCE ON COMPUTER SYSTEMS AND APPLICATIONS, PROCEEDINGS, 2001, : 181 - 184
  • [28] Schema Matching for Semi-structured and Linked Data
    Kettouch, Mohamed
    Luca, Cristina
    Hobbs, Mike
    2017 11TH IEEE INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC), 2017, : 270 - 271
  • [29] Semi-structured document image matching and recognition
    Augereau, Olivier
    Journet, Nicholas
    Domenger, Jean-Philippe
    DOCUMENT RECOGNITION AND RETRIEVAL XX, 2013, 8658
  • [30] Evaluation of Header Metadata Extraction Approaches and Tools for Scientific PDF Documents
    Lipinski, Mario
    Yao, Kevin
    Breitinger, Corinna
    Beel, Joeran
    Gipp, Bela
    JCDL'13: PROCEEDINGS OF THE 13TH ACM/IEEE-CS JOINT CONFERENCE ON DIGITAL LIBRARIES, 2013, : 385 - 386