Digitization of Text documents Using PDF/A

被引：2

作者：

Han, Yan ^{[1
]}

Wan, Xueheng ^{[2
]}

机构：

[1] Univ Arizona Lib, Tucson, AZ 85721 USA

[2] Univ Arizona, Dept Comp Sci, Tucson, AZ 85721 USA

来源：

INFORMATION TECHNOLOGY AND LIBRARIES | 2018年 / 37卷 / 01期

关键词：

D O I：

10.6017/ITAL.V37I1.9878

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The purpose of this article is to demonstrate a practical use case of PDF/A for digitization of text documents following FADGI's recommendation of using PDF/A as a preferred digitization file format. The authors demonstrate how to convert and combine TIFFs with associated metadata into a single PDF/A-2b file for a document. Using real-life examples and open source software, the authors show readers how to convert TIFF images, extract associated metadata and International Color Consortium (ICC) profiles, and validate against the newly released PDF/A validator. The generated PDF/A file is a self-contained and self-described container that accommodates all the data from digitization of textual materials, including page-level metadata and ICC profiles. Providing theoretical analysis and empirical examples, the authors show that PDF/A has many advantages over the traditionally preferred file format, TIFF/JPEG2000, for digitization of text documents.

引用

页码：52 / 64

页数：13

共 50 条

[1] Automatic Text Classification of PDF Documents using NLP Techniques
Abdoun, Nabil
Chami, Mohammad
[J]. INCOSE International Symposium, 2022, 32 (01) : 1320 - 1331
[2] Intelligent text extraction from PDF documents
Hassan, Tamir
Baumgartner, Robert
[J]. INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE FOR MODELLING, CONTROL & AUTOMATION JOINTLY WITH INTERNATIONAL CONFERENCE ON INTELLIGENT AGENTS, WEB TECHNOLOGIES & INTERNET COMMERCE, VOL 2, PROCEEDINGS, 2006, : 2 - +
[3] Extracting Body Text from Academic PDF Documents for Text Mining
Yu, Changfeng
Zhang, Cheng
Wang, Jie
[J]. PROCEEDINGS OF THE 12TH INTERNATIONAL JOINT CONFERENCE ON KNOWLEDGE DISCOVERY, KNOWLEDGE ENGINEERING AND KNOWLEDGE MANAGEMENT (KDIR), VOL 1, 2020, : 235 - 242
[4] Recovering Text from Endangered Languages Corrupted PDF documents
Stefanovitch, Nicolas
[J]. PROCEEDINGS OF THE FIFTH WORKSHOP ON THE USE OF COMPUTATIONAL METHODS IN THE STUDY OF ENDANGERED LANGUAGES (COMPUTEL-5 2022), 2022, : 78 - 82
[5] Digitization of Documents Using Photography as the Method of Capture
Dines, I. R.
Gariepy, L.
Kent, C.
Coulson, L.
[J]. TRANSFUSION, 2011, 51 : 281A - 281A
[6] DIGITIZATION OF ELECTROPHYSIOLOGICAL DOCUMENTS
BERNARD, J
DELHOMME, M
TRIGEASSOU, JC
MARILLAUD, A
ROUSSEAU, F
[J]. ELECTROENCEPHALOGRAPHY AND CLINICAL NEUROPHYSIOLOGY, 1986, 63 (05): : 497 - 500
[7] Are PDF documents accessible?
Turro, Mireia Ribera
[J]. INFORMATION TECHNOLOGY AND LIBRARIES, 2008, 27 (03) : 25 - 43
[8] The digitization of documents, friend or enemy?
Gonzalez Mesa, Elda
[J]. BIBLIOTECAS-ANALES DE INVESTIGACION, 2006, (02): : 150 - 154
[9] Reports and other PDF documents
Camara, Rafael J. A.
[J]. STATA JOURNAL, 2014, 14 (01): : 103 - 118
[10] Identification of embedded mathematical formulas in PDF documents using SVM
Lin, Xiaoyan
Gao, Liangcai
Tang, Zhi
Hu, Xuan
Lin, Xiaofan
[J]. DOCUMENT RECOGNITION AND RETRIEVAL XIX, 2012, 8297

← 1 2 3 4 5 →