Fiscal data in text: Information extraction from audit reports using Natural Language Processing

被引:0
|
作者
Beltran, Alejandro [1 ]
机构
[1] Alan Turing Inst, London, England
来源
DATA & POLICY | 2023年 / 5卷
关键词
auditing; corruption; natural language processing; subnational governments; text-as-data; CORRUPTION; MALFEASANCE;
D O I
10.1017/dap.2023.4
中图分类号
C93 [管理学]; D035 [国家行政管理]; D523 [行政管理]; D63 [国家行政管理];
学科分类号
12 ; 1201 ; 1202 ; 120202 ; 1204 ; 120401 ;
摘要
Supreme audit institutions (SAIs) are touted as an integral component to anticorruption efforts in developing nations. SAIs review governmental budgets and report fiscal discrepancies in publicly available audit reports. These documents contain valuable information on budgetary discrepancies, missing resources, or may even report fraud and corruption. Existing research on anticorruption efforts relies on information published by national-level SAIs while mostly ignoring audits from subnational SAIs because their information is not published in accessible formats. I collect publicly available audit reports published by a subnational SAI in Mexico, the Auditoria Superior del Estado de Sinaloa, and build a pipeline for extracting the monetary value of discrepancies detected in municipal budgets. I systematically convert scanned documents into machine-readable text using optical character recognition, and I then train a classification model to identify paragraphs with relevant information. From the relevant paragraphs, I extract the monetary values of budgetary discrepancies by developing a named entity recognizer that automates the identification of this information. In this paper, I explain the steps for building the pipeline and detail the procedures for replicating it in different contexts. The resulting dataset contains the official amounts of discrepancies in municipal budgets for the state of Sinaloa. This information is useful to anticorruption policymakers because it quantifies discrepancies in municipal spending potentially motivating reforms that mitigate misappropriation. Although I focus on a single state in Mexico, this method can be extended to any context where audit reports are publicly available.Policy Significance Statement Annual audits by supreme audit institutions produce important information on the health and accuracy of govern-mental budgets. These reports include the monetary value of discrepancies, missing funds, and corrupt actions. This paper offers a strategy for collecting that information from historical audit reports and creating a database on budgetary discrepancies. It uses machine learning and natural language processing to accelerate and scale the collection of data to thousands of paragraphs. The granularity of the budgetary data obtained through this approach is useful to reformers and policymakers who require detailed data on municipal finances. This approach can also be applied to other countries that publish audit reports in PDF documents across different languages and contexts.
引用
收藏
页数:13
相关论文
共 50 条
  • [31] Extraction of Adverse Event Severity Information from Clinical Narratives Using Natural Language Processing
    Jacobsson, Rebecka
    Bergvall, Tomas
    Sandberg, Lovisa
    Ellenius, Johan
    PHARMACOEPIDEMIOLOGY AND DRUG SAFETY, 2017, 26 : 37 - 37
  • [32] A Toolkit for Text Extraction and Analysis for Natural Language Processing Tasks
    Sefara, Tshephisho Joseph
    Mbooi, Mahlatse
    Mashile, Katlego
    Rambuda, Thompho
    Rangata, Mapitsi
    5TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, BIG DATA, COMPUTING AND DATA COMMUNICATION SYSTEMS (ICABCD2022), 2022,
  • [33] Web Document Text and Images Extraction using DOM Analysis and Natural Language Processing
    Joshi, Parag Mulendra
    Liu, Sam
    DOCENG'09: PROCEEDINGS OF THE 2009 ACM SYMPOSIUM ON DOCUMENT ENGINEERING, 2009, : 218 - 221
  • [34] Using Natural Language Processing for Extracting Information from Portable Chest X-Ray Reports
    Wang, D. Y.
    Hwang, T. S.
    Rubin, D.
    Chambers, J.
    South, B. R.
    Goldstein, M. K.
    JOURNAL OF THE AMERICAN GERIATRICS SOCIETY, 2013, 61 : S103 - S103
  • [35] Negation and uncertainty information extraction oriented to natural language text
    Zou B.-W.
    Qian Z.
    Chen Z.-C.
    Zhu Q.-M.
    Zhou G.-D.
    Ruan Jian Xue Bao/Journal of Software, 2016, 27 (02): : 309 - 328
  • [36] Natural language processing in urology: Automated extraction of clinical information from histopathology reports of uro-oncology procedures
    Huang, Honghong
    Lim, Fiona Xin Yi
    Gu, Gary Tianyu
    Han, Matthew Jiangchou
    Fang, Andrew Hao Sen
    Chia, Elian Hui San
    Bei, Eileen Yen Tze
    Tham, Sarah Zhuling
    Ho, Henry Sun Sien
    Yuen, John Shyi Peng
    Sun, Aixin
    Lim, Jay Kheng Sit
    HELIYON, 2023, 9 (04)
  • [37] Data Extraction from Natural Language using Universal Networking Language
    Saha, Aloke Kumar
    Mridha, M. F.
    Rafiq, Jahir Ibna
    Das, Jugal Krishna
    2017 INTERNATIONAL CONFERENCE ON CURRENT TRENDS IN COMPUTER, ELECTRICAL, ELECTRONICS AND COMMUNICATION (CTCEEC), 2017, : 1228 - 1232
  • [38] Automatic Extraction of Engineering Rules From Unstructured Text: A Natural Language Processing Approach
    Ye, Xinfeng
    Lu, Yuqian
    JOURNAL OF COMPUTING AND INFORMATION SCIENCE IN ENGINEERING, 2020, 20 (03)
  • [39] Automatic Diagnosis Labeling of Cardiovascular MRI by Using Semisupervised Natural Language Processing of Text Reports
    Zaman, Sameer
    Petri, Camille
    Vimalesvaran, Kavitha
    Howard, James
    Bharath, Anil
    Francis, Darrel
    Peters, Nicholas
    Cole, Graham D.
    Linton, Nick
    RADIOLOGY-ARTIFICIAL INTELLIGENCE, 2022, 4 (01)
  • [40] Clinical Text Reports to Stratify Patients Affected with Myeloid Neoplasms Using Natural Language Processing
    Asti, Gianluca
    Sauta, Elisabetta
    Curti, Nico
    Carlini, Gianluca
    Dall'Olio, Lorenzo
    Lanino, Luca
    Maggioni, Giulia
    Campagna, Alessia
    Ubezio, Marta
    Russo, Antonio
    Todisco, Gabriele
    Tentori, Cristina Astrid
    Morandini, Pierandrea
    Bicchieri, Marilena
    Grondelli, Maria Chiara
    Zampini, Matteo
    Travaglino, Erica
    Savevski, Victor
    Derus, Nicolas Riccardo
    Dall'Olio, Daniele
    Sala, Claudia
    Zhao, Lin-Pierre
    Santoro, Armando
    Kordasti, Shahram
    Santini, Valeria
    Kubasch, Anne Sophie
    Platzbecker, Uwe
    Diez-Campelo, Maria
    Fenaux, Pierre
    Zeidan, Amer M.
    Haferlach, Torsten
    Castellani, Gastone
    Della Porta, Matteo Giovanni
    D'Amico, Saverio
    BLOOD, 2023, 142