Header metadata extraction from semi-structured documents using template matching

被引:0
|
作者
Huang, Zewu [1 ]
Jin, Hai [1 ]
Yuan, Pingpeng [1 ]
Han, Zongfen [1 ]
机构
[1] Huazhong Univ Sci & Technol, Cluster & Grid Comp Lab, Wuhan 430074, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the recent proliferation of documents, automatic metadata extraction from document becomes an important task. In this paper, we propose a novel template matching based method for header metadata extraction form semi-structured documents stored in PDF. In our approach, templates are defined, and the document is considered as strings with format. Templates are used to guide finite state automaton (FSA) to extract header metadata of papers. ne testing results indicate that our approach can effectively extract metadata, without any training cost and available to some special situation. This approach can effectively assist the automatic index creation in lots of fields such as digital libraries, information retrieval, and data mining.
引用
收藏
页码:1776 / +
页数:3
相关论文
共 50 条
  • [31] Supporting Semantic Search on Heterogeneous Semi-structured Documents
    Mrabet, Yassine
    Bennacer, Nacera
    Pernelle, Nathalie
    Thiam, Mouhamadou
    ADVANCED INFORMATION SYSTEMS ENGINEERING, PROCEEDINGS, 2010, 6051 : 224 - +
  • [32] Characteristic sets of strings common to semi-structured documents
    Ikeda, D
    DISCOVERY SCIENCE, PROCEEDINGS, 1999, 1721 : 139 - 147
  • [33] Filtering Semi-Structured Documents Based on Faceted Feedback
    Zhang, Lanbo
    Zhang, Yi
    Xing, Qianli
    PROCEEDINGS OF THE 34TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR'11), 2011, : 645 - 654
  • [34] Unsupervised Extraction of Product Information from Semi-structured Sources
    Walther, Maximilian
    13TH IEEE INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND INFORMATICS (CINTI 2012), 2012, : 257 - 262
  • [35] Bootstrapping Information Extraction from Semi-structured Web Pages
    Carlson, Andrew
    Schafer, Charles
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, PART I, PROCEEDINGS, 2008, 5211 : 195 - +
  • [36] A semantic network approach to semi-structured documents repositories
    Christophides, V
    Dorr, M
    Fundulaki, I
    RESEARCH AND ADVANCED TECHNOLOGY FOR DIGITAL LIBRARIES, 1997, 1324 : 305 - 324
  • [37] Data extraction from semi-structured web pages by clustering
    Vuong, Le Phong Bao
    Gao, Xiaoying
    Zhang, Mengjie
    2006 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, (WI 2006 MAIN CONFERENCE PROCEEDINGS), 2006, : 374 - +
  • [38] Using Semantic Similarity for Schema Matching of Semi-structured and Linked Data
    Kettouch, Mohamed Salah
    Luca, Cristina
    Hobbs, Mike
    Dascalu, Sergiu
    PROCEEDINGS OF THE 2017 7TH INTERNATIONAL CONFERENCE INTERNET TECHNOLOGIES AND APPLICATIONS (ITA), 2017, : 128 - 133
  • [39] Named Entity Extraction from Semi-structured Data Using Machine Learning Algorithms
    Mansurova, Madina
    Barakhnin, Vladimir
    Khibatkhanuly, Yerzhan
    Pastushkov, Ilya
    COMPUTATIONAL COLLECTIVE INTELLIGENCE, PT II, 2019, 11684 : 58 - 69
  • [40] Extraction and transformation of data from semi-structured text files using a declarative approach
    Raminhos, R.
    Moura-Pires, J.
    ICEIS 2007: PROCEEDINGS OF THE NINTH INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS: DATABASES AND INFORMATION SYSTEMS INTEGRATION, 2007, : 199 - +