Header metadata extraction from semi-structured documents using template matching

被引:0
|
作者
Huang, Zewu [1 ]
Jin, Hai [1 ]
Yuan, Pingpeng [1 ]
Han, Zongfen [1 ]
机构
[1] Huazhong Univ Sci & Technol, Cluster & Grid Comp Lab, Wuhan 430074, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the recent proliferation of documents, automatic metadata extraction from document becomes an important task. In this paper, we propose a novel template matching based method for header metadata extraction form semi-structured documents stored in PDF. In our approach, templates are defined, and the document is considered as strings with format. Templates are used to guide finite state automaton (FSA) to extract header metadata of papers. ne testing results indicate that our approach can effectively extract metadata, without any training cost and available to some special situation. This approach can effectively assist the automatic index creation in lots of fields such as digital libraries, information retrieval, and data mining.
引用
收藏
页码:1776 / +
页数:3
相关论文
共 50 条
  • [41] List data extraction in semi-structured document
    Xu, H
    Li, JZ
    Xu, P
    WEB INFORMATION SYSTEMS ENGINEERING - WISE 2005, 2005, 3806 : 584 - 585
  • [42] Mining Entities and their Values from Semi-Structured Documents in Business Process Outsourcing
    Guggilla, Chinnappa
    Pandey, Ankit G.
    Kummamuru, Krishna
    Shivaram, Madhura
    PROCEEDINGS OF THE ACM INDIA JOINT INTERNATIONAL CONFERENCE ON DATA SCIENCE AND MANAGEMENT OF DATA (CODS-COMAD'18), 2018, : 283 - 288
  • [43] An Automatic Ontology Population with a Machine Learning Technique from Semi-Structured Documents
    Song, Hyun-Je
    Park, Seong-Bae
    Park, Se-Young
    ICIA: 2009 INTERNATIONAL CONFERENCE ON INFORMATION AND AUTOMATION, VOLS 1-3, 2009, : 519 - 524
  • [44] WebDP: Understanding Discourse Structures in Semi-Structured Web Documents
    Liu, Peilin
    Lin, Hongyu
    Liao, Meng
    Xiang, Hao
    Han, Xianpei
    Sun, Le
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 10235 - 10258
  • [45] Knowledge extraction from semi-structured data based on fuzzy techniques
    Ceravolo, P
    Nocerino, MC
    Viviani, M
    KNOWLEDGE-BASED INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT 3, PROCEEDINGS, 2004, 3215 : 328 - 334
  • [46] Exact and establish of ancient book's metadata in semi-structured data
    Li Qingcheng
    Wang Qing
    Liu Jiaxin
    PROCEEDINGS OF E-ENGDET2006, 2006, : 249 - 253
  • [47] Supplementing domain knowledge to BERT with semi-structured information of documents
    Chen, Jing
    Wei, Zhihua
    Wang, Jiaqi
    Wang, Rui
    Gong, Chuanyang
    Zhang, Hongyun
    Miao, Duoqian
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 235
  • [48] An approach to semantic information retrieval in heterogeneous semi-structured documents
    Mrabet, Yassine
    Bennacer, Nacéra
    Pernelle, Nathalie
    Thiam, Mouhamadou
    CORIA 2010: Actes de la COnference en Recherche d'Information et Applications - Proceedings of the Conference on Information Retrieval and Applications, 2010, : 195 - 210
  • [49] Joint Distributed Representation of Text and Structure of Semi-Structured Documents
    Laddha, Abhishek
    Joshi, Salil
    Shaikh, Samiulla
    Mehta, Sameep
    HT'18: PROCEEDINGS OF THE 29TH ACM CONFERENCE ON HYPERTEXT AND SOCIAL MEDIA, 2018, : 25 - 32
  • [50] CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web
    Lockard, Colin
    Dong, Xin Luna
    Einolghozati, Arash
    Shiralkar, Prashant
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2018, 11 (10): : 1084 - 1096