Digitization of Text documents Using PDF/A

被引:2
|
作者
Han, Yan [1 ]
Wan, Xueheng [2 ]
机构
[1] Univ Arizona Lib, Tucson, AZ 85721 USA
[2] Univ Arizona, Dept Comp Sci, Tucson, AZ 85721 USA
关键词
D O I
10.6017/ITAL.V37I1.9878
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The purpose of this article is to demonstrate a practical use case of PDF/A for digitization of text documents following FADGI's recommendation of using PDF/A as a preferred digitization file format. The authors demonstrate how to convert and combine TIFFs with associated metadata into a single PDF/A-2b file for a document. Using real-life examples and open source software, the authors show readers how to convert TIFF images, extract associated metadata and International Color Consortium (ICC) profiles, and validate against the newly released PDF/A validator. The generated PDF/A file is a self-contained and self-described container that accommodates all the data from digitization of textual materials, including page-level metadata and ICC profiles. Providing theoretical analysis and empirical examples, the authors show that PDF/A has many advantages over the traditionally preferred file format, TIFF/JPEG2000, for digitization of text documents.
引用
收藏
页码:52 / 64
页数:13
相关论文
共 50 条
  • [1] Automatic Text Classification of PDF Documents using NLP Techniques
    Abdoun, Nabil
    Chami, Mohammad
    [J]. INCOSE International Symposium, 2022, 32 (01) : 1320 - 1331
  • [2] Intelligent text extraction from PDF documents
    Hassan, Tamir
    Baumgartner, Robert
    [J]. INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE FOR MODELLING, CONTROL & AUTOMATION JOINTLY WITH INTERNATIONAL CONFERENCE ON INTELLIGENT AGENTS, WEB TECHNOLOGIES & INTERNET COMMERCE, VOL 2, PROCEEDINGS, 2006, : 2 - +
  • [3] Extracting Body Text from Academic PDF Documents for Text Mining
    Yu, Changfeng
    Zhang, Cheng
    Wang, Jie
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL JOINT CONFERENCE ON KNOWLEDGE DISCOVERY, KNOWLEDGE ENGINEERING AND KNOWLEDGE MANAGEMENT (KDIR), VOL 1, 2020, : 235 - 242
  • [4] Recovering Text from Endangered Languages Corrupted PDF documents
    Stefanovitch, Nicolas
    [J]. PROCEEDINGS OF THE FIFTH WORKSHOP ON THE USE OF COMPUTATIONAL METHODS IN THE STUDY OF ENDANGERED LANGUAGES (COMPUTEL-5 2022), 2022, : 78 - 82
  • [5] Digitization of Documents Using Photography as the Method of Capture
    Dines, I. R.
    Gariepy, L.
    Kent, C.
    Coulson, L.
    [J]. TRANSFUSION, 2011, 51 : 281A - 281A
  • [6] DIGITIZATION OF ELECTROPHYSIOLOGICAL DOCUMENTS
    BERNARD, J
    DELHOMME, M
    TRIGEASSOU, JC
    MARILLAUD, A
    ROUSSEAU, F
    [J]. ELECTROENCEPHALOGRAPHY AND CLINICAL NEUROPHYSIOLOGY, 1986, 63 (05): : 497 - 500
  • [7] Are PDF documents accessible?
    Turro, Mireia Ribera
    [J]. INFORMATION TECHNOLOGY AND LIBRARIES, 2008, 27 (03) : 25 - 43
  • [8] The digitization of documents, friend or enemy?
    Gonzalez Mesa, Elda
    [J]. BIBLIOTECAS-ANALES DE INVESTIGACION, 2006, (02): : 150 - 154
  • [9] Reports and other PDF documents
    Camara, Rafael J. A.
    [J]. STATA JOURNAL, 2014, 14 (01): : 103 - 118
  • [10] Identification of embedded mathematical formulas in PDF documents using SVM
    Lin, Xiaoyan
    Gao, Liangcai
    Tang, Zhi
    Hu, Xuan
    Lin, Xiaofan
    [J]. DOCUMENT RECOGNITION AND RETRIEVAL XIX, 2012, 8297