Digitizing and parsing semi-structured historical administrative documents from the GI Bill mortgage guarantee program

被引:0
|
作者
Lafia, Sara [1 ]
Bleckley, David A. [1 ]
Alexander, J. Trent [1 ]
机构
[1] Univ Michigan, ICPSR, Ann Arbor, MI 48106 USA
关键词
Archives; Digital libraries; Document image processing; Records management; Named entity recognition; Social sciences; OPTICAL CHARACTER-RECOGNITION;
D O I
10.1108/JD-03-2023-0055
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
PurposeMany libraries and archives maintain collections of research documents, such as administrative records, with paper-based formats that limit the documents' access to in-person use. Digitization transforms paper-based collections into more accessible and analyzable formats. As collections are digitized, there is an opportunity to incorporate deep learning techniques, such as Document Image Analysis (DIA), into workflows to increase the usability of information extracted from archival documents. This paper describes the authors' approach using digital scanning, optical character recognition (OCR) and deep learning to create a digital archive of administrative records related to the mortgage guarantee program of the Servicemen's Readjustment Act of 1944, also known as the G.I. Bill.Design/methodology/approachThe authors used a collection of 25,744 semi-structured paper-based records from the administration of G.I. Bill Mortgages from 1946 to 1954 to develop a digitization and processing workflow. These records include the name and city of the mortgagor, the amount of the mortgage, the location of the Reconstruction Finance Corporation agent, one or more identification numbers and the name and location of the bank handling the loan. The authors extracted structured information from these scanned historical records in order to create a tabular data file and link them to other authoritative individual-level data sources.FindingsThe authors compared the flexible character accuracy of five OCR methods. The authors then compared the character error rate (CER) of three text extraction approaches (regular expressions, DIA and named entity recognition (NER)). The authors were able to obtain the highest quality structured text output using DIA with the Layout Parser toolkit by post-processing with regular expressions. Through this project, the authors demonstrate how DIA can improve the digitization of administrative records to automatically produce a structured data resource for researchers and the public.Originality/valueThe authors' workflow is readily transferable to other archival digitization projects. Through the use of digital scanning, OCR and DIA processes, the authors created the first digital microdata file of administrative records related to the G.I. Bill mortgage guarantee program available to researchers and the general public. These records offer research insights into the lives of veterans who benefited from loans, the impacts on the communities built by the loans and the institutions that implemented them.
引用
收藏
页码:225 / 239
页数:15
相关论文
共 9 条
  • [1] Information extraction from semi-structured web documents
    Yun, Bo-Hyun
    Seo, Chang-Ho
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, 2006, 4092 : 586 - 598
  • [2] Recognition techniques for extracting information from semi-structured documents
    Della Ventura, A
    Gagliardi, I
    Zonta, B
    DOCUMENT RECOGNITION AND RETRIEVAL VIII, 2001, 4307 : 130 - 137
  • [3] Building RDF ontologies from semi-structured legal documents
    Amato, Flora
    Mazzeo, Antonino
    Penta, Antonio
    Picariello, Antonio
    CISIS 2008: THE SECOND INTERNATIONAL CONFERENCE ON COMPLEX, INTELLIGENT AND SOFTWARE INTENSIVE SYSTEMS, PROCEEDINGS, 2008, : 997 - 1002
  • [4] Transformation rules from semi-structured XML documents to database model
    Badr, Y
    Sayah, M
    Laforest, F
    Flory, A
    ACS/IEEE INTERNATIONAL CONFERENCE ON COMPUTER SYSTEMS AND APPLICATIONS, PROCEEDINGS, 2001, : 181 - 184
  • [5] Header metadata extraction from semi-structured documents using template matching
    Huang, Zewu
    Jin, Hai
    Yuan, Pingpeng
    Han, Zongfen
    ON THE MOVE TO MEANINGFUL INTERNET SYSTEMS 2006: OTM 2006 WORKSHOPS, PT 2, PROCEEDINGS, 2006, 4278 : 1776 - +
  • [6] Mining Entities and their Values from Semi-Structured Documents in Business Process Outsourcing
    Guggilla, Chinnappa
    Pandey, Ankit G.
    Kummamuru, Krishna
    Shivaram, Madhura
    PROCEEDINGS OF THE ACM INDIA JOINT INTERNATIONAL CONFERENCE ON DATA SCIENCE AND MANAGEMENT OF DATA (CODS-COMAD'18), 2018, : 283 - 288
  • [7] An Automatic Ontology Population with a Machine Learning Technique from Semi-Structured Documents
    Song, Hyun-Je
    Park, Seong-Bae
    Park, Se-Young
    ICIA: 2009 INTERNATIONAL CONFERENCE ON INFORMATION AND AUTOMATION, VOLS 1-3, 2009, : 519 - 524
  • [8] RETRACTED: Extracting Information from Semi-structured Web Documents: A Framework (Retracted Article)
    Memon, Nasrullah
    Qureshi, Abdul Rasool
    Hicks, David
    Harkiolakis, Nicholas
    ADVANCED WEB AND NETWORK TECHNOLOGIES, AND APPLICATIONS, 2008, 4977 : 54 - +
  • [9] LLM Based Multi-agent Generation of Semi-structured Documents from Semantic Templates in the Public Administration Domain
    Musumeci, Emanuele
    Brienza, Michele
    Suriani, Vincenzo
    Nardi, Daniele
    Bloisi, Domenico Daniele
    ARTIFICIAL INTELLIGENCE IN HCI, PT III, AI-HCI 2024, 2024, 14736 : 98 - 117