Configurable Customized Information Extraction and Processing Pipeline

被引:0
|
作者
Kim, Seok [1 ]
Lai, Pierce [1 ]
Khan, Dariyan [1 ]
Zhao, Kevin [1 ]
Le, Brian [1 ]
Luchianov, Alex [1 ]
Yu, Margaret [1 ]
Wang, Patrick [1 ]
机构
[1] MIT, Comp Sci & Artificial Intelligence Lab CSAIL, 32 Vassar St, Cambridge, MA 02139 USA
关键词
AI; OCR; document processing;
D O I
10.1142/S0218001424590122
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Extracting information from scanned business documents, while a necessary commercial task, continues to be mostly done manually, requiring significant human effort. Current solutions for automated document information extraction still have limited capabilities in regards to user-required customizability and extraction of dataset-specific information, leaving the area as a very active field of research. In this paper, we propose modifications and improvements to our previously developed custom pipeline for extracting and tabulating key-value pairs from commercial invoice documents. Our design changes and additions adapt the pipeline to a wider variety of document types and use cases, primarily through the implementation of dataset-specific configuration files that promote customizability along with new technical modules that address both general and dataset-specific complexities. We compare our pipeline's performance against current machine learning and commercial solutions on a real-world dataset, and demonstrate that it is able to extract a wider variety of fields while maintaining competitive or greater accuracies compared to the alternate solutions.
引用
收藏
页数:25
相关论文
共 50 条
  • [1] Customized Information Extraction and Processing Pipeline for Commercial Invoices
    Lai, Pierce
    Mohan, Abhishek
    Kim, Seok
    Chu, Jung Soo Victor
    Lee, Samuel
    Kafle, Prabhakar
    Wang, Patrick
    [J]. INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2023, 37 (09)
  • [2] Study of Information Extraction Method of Large-Scale Processing Pipeline
    Zhang, Shuxuan
    Wang, Zhe
    Zhang, Qing
    Pang, Litao
    [J]. INTERNATIONAL CONFERENCE ON MATERIALS PROCESSING AND MECHANICAL MANUFACTURING ENGINEERING (MPMME 2015), 2015, : 109 - 113
  • [3] Customized pipeline and instruction set architecture for embedded processing engines
    Yazdanbakhsh, Amir
    Salehi, Mostafa E.
    Fakhraie, Sied Mehdi
    [J]. JOURNAL OF SUPERCOMPUTING, 2014, 68 (02): : 948 - 977
  • [4] Customized pipeline and instruction set architecture for embedded processing engines
    Amir Yazdanbakhsh
    Mostafa E. Salehi
    Sied Mehdi Fakhraie
    [J]. The Journal of Supercomputing, 2014, 68 : 948 - 977
  • [5] Customized information extraction as a basis for resource discovery
    Hardy, DR
    Schwartz, MF
    [J]. ACM TRANSACTIONS ON COMPUTER SYSTEMS, 1996, 14 (02): : 171 - 199
  • [6] Customized warranty offering for configurable products
    Liu, Yiliu
    Liu, Zixian
    Wang, Yukun
    [J]. RELIABILITY ENGINEERING & SYSTEM SAFETY, 2013, 118 : 1 - 7
  • [7] An Efficient, Robust, and Customizable Information Extraction and Pre-processing Pipeline for Electronic Health Records
    Lee, Eva K.
    Wang, Yuanbo
    He, Yuntian
    Egan, Brent M.
    [J]. KDIR: PROCEEDINGS OF THE 11TH INTERNATIONAL JOINT CONFERENCE ON KNOWLEDGE DISCOVERY, KNOWLEDGE ENGINEERING AND KNOWLEDGE MANAGEMENT - VOL 1: KDIR, 2019, : 310 - 321
  • [8] Information Processing For Mass-Customized Clothing Production
    Elbrecht, Pirjo
    Palm, Knut-Joosep
    [J]. 2016 IEEE TENTH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC), 2016, : 361 - 364
  • [9] A Data Pipeline for Extraction and Processing of Electrocardiogram Recordings
    Prim, Joshua
    Uhlemann, Tim
    Gumpfer, Nils
    Gruen, Dimitri
    Wegener, Sebastian
    Krug, Sabrina
    Hannig, Jennifer
    Keller, Till
    Guckert, Michael
    [J]. 2021 COMPUTING IN CARDIOLOGY (CINC), 2021,
  • [10] A Multilingual Information Extraction Pipeline for Investigative Journalism
    Wiedemann, Gregor
    Yimam, Seid Muhie
    Biemann, Chris
    [J]. CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018): PROCEEDINGS OF SYSTEM DEMONSTRATIONS, 2018, : 78 - 83