Header metadata extraction from semi-structured documents using template matching

被引：0

作者：

Huang, Zewu ^{[1
]}

Jin, Hai ^{[1
]}

Yuan, Pingpeng ^{[1
]}

Han, Zongfen ^{[1
]}

机构：

[1] Huazhong Univ Sci & Technol, Cluster & Grid Comp Lab, Wuhan 430074, Peoples R China

来源：

ON THE MOVE TO MEANINGFUL INTERNET SYSTEMS 2006: OTM 2006 WORKSHOPS, PT 2, PROCEEDINGS | 2006年 / 4278卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

With the recent proliferation of documents, automatic metadata extraction from document becomes an important task. In this paper, we propose a novel template matching based method for header metadata extraction form semi-structured documents stored in PDF. In our approach, templates are defined, and the document is considered as strings with format. Templates are used to guide finite state automaton (FSA) to extract header metadata of papers. ne testing results indicate that our approach can effectively extract metadata, without any training cost and available to some special situation. This approach can effectively assist the automatic index creation in lots of fields such as digital libraries, information retrieval, and data mining.

引用

页码：1776 / +

页数：3

共 50 条

[31] Supporting Semantic Search on Heterogeneous Semi-structured Documents
Mrabet, Yassine
Bennacer, Nacera
Pernelle, Nathalie
Thiam, Mouhamadou
ADVANCED INFORMATION SYSTEMS ENGINEERING, PROCEEDINGS, 2010, 6051 : 224 - +
[32] Characteristic sets of strings common to semi-structured documents
Ikeda, D
DISCOVERY SCIENCE, PROCEEDINGS, 1999, 1721 : 139 - 147
[33] Filtering Semi-Structured Documents Based on Faceted Feedback
Zhang, Lanbo
Zhang, Yi
Xing, Qianli
PROCEEDINGS OF THE 34TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR'11), 2011, : 645 - 654
[34] Unsupervised Extraction of Product Information from Semi-structured Sources
Walther, Maximilian
13TH IEEE INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND INFORMATICS (CINTI 2012), 2012, : 257 - 262
[35] Bootstrapping Information Extraction from Semi-structured Web Pages
Carlson, Andrew
Schafer, Charles
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, PART I, PROCEEDINGS, 2008, 5211 : 195 - +
[36] A semantic network approach to semi-structured documents repositories
Christophides, V
Dorr, M
Fundulaki, I
RESEARCH AND ADVANCED TECHNOLOGY FOR DIGITAL LIBRARIES, 1997, 1324 : 305 - 324
[37] Data extraction from semi-structured web pages by clustering
Vuong, Le Phong Bao
Gao, Xiaoying
Zhang, Mengjie
2006 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, (WI 2006 MAIN CONFERENCE PROCEEDINGS), 2006, : 374 - +
[38] Using Semantic Similarity for Schema Matching of Semi-structured and Linked Data
Kettouch, Mohamed Salah
Luca, Cristina
Hobbs, Mike
Dascalu, Sergiu
PROCEEDINGS OF THE 2017 7TH INTERNATIONAL CONFERENCE INTERNET TECHNOLOGIES AND APPLICATIONS (ITA), 2017, : 128 - 133
[39] Named Entity Extraction from Semi-structured Data Using Machine Learning Algorithms
Mansurova, Madina
Barakhnin, Vladimir
Khibatkhanuly, Yerzhan
Pastushkov, Ilya
COMPUTATIONAL COLLECTIVE INTELLIGENCE, PT II, 2019, 11684 : 58 - 69
[40] Extraction and transformation of data from semi-structured text files using a declarative approach
Raminhos, R.
Moura-Pires, J.
ICEIS 2007: PROCEEDINGS OF THE NINTH INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS: DATABASES AND INFORMATION SYSTEMS INTEGRATION, 2007, : 199 - +

← 1 2 3 4 5 →