Header metadata extraction from semi-structured documents using template matching

被引：0

作者：

Huang, Zewu ^{[1
]}

Jin, Hai ^{[1
]}

Yuan, Pingpeng ^{[1
]}

Han, Zongfen ^{[1
]}

机构：

[1] Huazhong Univ Sci & Technol, Cluster & Grid Comp Lab, Wuhan 430074, Peoples R China

来源：

ON THE MOVE TO MEANINGFUL INTERNET SYSTEMS 2006: OTM 2006 WORKSHOPS, PT 2, PROCEEDINGS | 2006年 / 4278卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

With the recent proliferation of documents, automatic metadata extraction from document becomes an important task. In this paper, we propose a novel template matching based method for header metadata extraction form semi-structured documents stored in PDF. In our approach, templates are defined, and the document is considered as strings with format. Templates are used to guide finite state automaton (FSA) to extract header metadata of papers. ne testing results indicate that our approach can effectively extract metadata, without any training cost and available to some special situation. This approach can effectively assist the automatic index creation in lots of fields such as digital libraries, information retrieval, and data mining.

引用

页码：1776 / +

页数：3

共 50 条

[21] Towards the automated verification of semi-structured documents
Weitl, Franz
Jaksic, Mirjana
Freitag, Burkhard
DATA & KNOWLEDGE ENGINEERING, 2009, 68 (03) : 292 - 317
[22] Business information extraction from semi-structured webpages
Sung, NH
Chang, YS
EXPERT SYSTEMS WITH APPLICATIONS, 2004, 26 (04) : 575 - 582
[23] Interactive Data Extraction from Semi-Structured Text
Broman, Per
Thalheim, Bernhard
INFORMATION MODELLING AND KNOWLEDGE BASES XXIII, 2012, 237 : 1 - 19
[24] Interactive tuples extraction from semi-structured data
Gilleron, Remi
Marty, Patrick
Tommasi, Marc
Torre, Fabien
2006 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, (WI 2006 MAIN CONFERENCE PROCEEDINGS), 2006, : 997 - 1004
[25] Information extraction from Web pages using semi-structured data alignment
Kuboyama, Tetsuji
Miyahara, Tetsuhiro
Hirokawa, Sachio
Itou, Eisuke
WMSCI 2005: 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Vol 1, 2005, : 42 - 47
[26] Using ILP to construct features for information extraction from semi-structured text
Ramakrishnan, Ganesh
Joshil, Sachindra
Balakrishnan, Sreeram
Srinivasan, Ashwin
INDUCTIVE LOGIC PROGRAMMING, 2008, 4894 : 211 - 224
[27] Transformation rules from semi-structured XML documents to database model
Badr, Y
Sayah, M
Laforest, F
Flory, A
ACS/IEEE INTERNATIONAL CONFERENCE ON COMPUTER SYSTEMS AND APPLICATIONS, PROCEEDINGS, 2001, : 181 - 184
[28] Schema Matching for Semi-structured and Linked Data
Kettouch, Mohamed
Luca, Cristina
Hobbs, Mike
2017 11TH IEEE INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC), 2017, : 270 - 271
[29] Semi-structured document image matching and recognition
Augereau, Olivier
Journet, Nicholas
Domenger, Jean-Philippe
DOCUMENT RECOGNITION AND RETRIEVAL XX, 2013, 8658
[30] Evaluation of Header Metadata Extraction Approaches and Tools for Scientific PDF Documents
Lipinski, Mario
Yao, Kevin
Breitinger, Corinna
Beel, Joeran
Gipp, Bela
JCDL'13: PROCEEDINGS OF THE 13TH ACM/IEEE-CS JOINT CONFERENCE ON DIGITAL LIBRARIES, 2013, : 385 - 386

← 1 2 3 4 5 →