Header metadata extraction from semi-structured documents using template matching

被引:0
|
作者
Huang, Zewu [1 ]
Jin, Hai [1 ]
Yuan, Pingpeng [1 ]
Han, Zongfen [1 ]
机构
[1] Huazhong Univ Sci & Technol, Cluster & Grid Comp Lab, Wuhan 430074, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the recent proliferation of documents, automatic metadata extraction from document becomes an important task. In this paper, we propose a novel template matching based method for header metadata extraction form semi-structured documents stored in PDF. In our approach, templates are defined, and the document is considered as strings with format. Templates are used to guide finite state automaton (FSA) to extract header metadata of papers. ne testing results indicate that our approach can effectively extract metadata, without any training cost and available to some special situation. This approach can effectively assist the automatic index creation in lots of fields such as digital libraries, information retrieval, and data mining.
引用
收藏
页码:1776 / +
页数:3
相关论文
共 50 条
  • [1] Information extraction from semi-structured web documents
    Yun, Bo-Hyun
    Seo, Chang-Ho
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, 2006, 4092 : 586 - 598
  • [2] Generalized Template Matching for Semi-structured Text
    Nagy, George
    PROCEEDINGS OF THE 6TH INTERNATIONAL WORKSHOP ON HISTORICAL DOCUMENT IMAGING AND PROCESSING, HIP 2021, 2021, : 55 - 60
  • [3] Automatic Content Extraction on Semi-Structured Documents
    dos Santos, Jose Eduardo Bastos
    11TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR 2011), 2011, : 1235 - 1239
  • [4] OLERA: OnLine extraction rule analysis for semi-structured documents
    Chang, CH
    Kuo, SC
    PROCEEDINGS OF THE IASTED INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND APPLICATIONS, VOLS 1AND 2, 2004, : 736 - 742
  • [5] EGA: An algorithm for automatic semi-structured Web documents extraction
    Li, LY
    Tang, SW
    Yang, DQ
    Wang, TJ
    Su, ZH
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, 2004, 2973 : 787 - 798
  • [6] Adding Structure to Semi-Structured Documents
    Moens, Marie-Francine
    LEGAL KNOWLEDGE AND INFORMATION SYSTEMS: JURIX 2009: THE TWENTY-SECOND ANNUAL CONFERENCE, 2009, 205 : IX - IX
  • [7] Automatic Generation of Semi-structured Documents
    Belhadj, Djedjiga
    Belaid, Yolande
    Belaid, Abdel
    DOCUMENT ANALYSIS AND RECOGNITION, ICDAR 2021, PT II, 2021, 12917 : 191 - 205
  • [8] A Semantic Kernel for semi-structured documents
    Aseervatham, Sujeevan
    Viennet, Emmanuel
    Bennani, Younes
    ICDM 2007: PROCEEDINGS OF THE SEVENTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, 2007, : 403 - 408
  • [9] Semantic annotation of semi-structured documents
    Ranganathan, Girish R.
    Biletskiy, Yevgen
    Kaltchenko, Alexey
    2008 CANADIAN CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING, VOLS 1-4, 2008, : 877 - +
  • [10] Low-Dimensionality Information Extraction Model for Semi-structured Documents
    Belhadj, Djedjiga
    Belaid, Abdel
    Belaid, Yolande
    COMPUTER ANALYSIS OF IMAGES AND PATTERNS, CAIP 2023, PT I, 2023, 14184 : 76 - 85