Multi-Layout Invoice Document Dataset (MIDD): A Dataset for Named Entity Recognition

被引:5
|
作者
Baviskar, Dipali [1 ]
Ahirrao, Swati [1 ]
Kotecha, Ketan [2 ]
机构
[1] Symbiosis Int, Symbiosis Inst Technol, Pune 412115, Maharashtra, India
[2] Symbiosis Int, Symbiosis Ctr Appl Artificial Intelligence, Pune 412115, Maharashtra, India
关键词
Artificial Intelligence (AI); information extraction; Named Entity Recognition (NER); unstructured data;
D O I
10.3390/data6070078
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The day-to-day working of an organization produces a massive volume of unstructured data in the form of invoices, legal contracts, mortgage processing forms, and many more. Organizations can utilize the insights concealed in such unstructured documents for their operational benefit. However, analyzing and extracting insights from such numerous and complex unstructured documents is a tedious task. Hence, the research in this area is encouraging the development of novel frameworks and tools that can automate the key information extraction from unstructured documents. However, the availability of standard, best-quality, and annotated unstructured document datasets is a serious challenge for accomplishing the goal of extracting key information from unstructured documents. This work expedites the researcher's task by providing a high-quality, highly diverse, multi-layout, and annotated invoice documents dataset for extracting key information from unstructured documents. Researchers can use the proposed dataset for layout-independent unstructured invoice document processing and to develop an artificial intelligence (AI)-based tool to identify and extract named entities in the invoice documents. Our dataset includes 630 invoice document PDFs with four different layouts collected from diverse suppliers. As far as we know, our invoice dataset is the only openly available dataset comprising high-quality, highly diverse, multi-layout, and annotated invoice documents. DataSet: http://doi.org/10.5281/zenodo.5113009 DataSet License: License under which the dataset is made available (CC-BY-4.0).
引用
收藏
页数:10
相关论文
共 50 条
  • [41] NoSta-D Named Entity Annotation for German: Guidelines and Dataset
    Benikova, Darina
    Biemann, Chris
    Reznicek, Marc
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 2524 - 2531
  • [42] Leveraging Multi-Token Entities in Document-Level Named Entity Recognition
    Hu, Anwen
    Dou, Zhicheng
    Nie, Jian-Yun
    Wen, Ji-Rong
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 7961 - 7968
  • [43] TCMNER and PubMed: A Novel Chinese Character-Level-Based Model and a Dataset for TCM Named Entity Recognition
    Liu, Zhi
    Luo, Changyong
    Zheng, Zeyu
    Li, Yan
    Fu, Dianzheng
    Yu, Xinzhu
    Zhao, Jiawei
    JOURNAL OF HEALTHCARE ENGINEERING, 2021, 2021
  • [44] A dataset for entity recognition in the automotive warranty and goodwill domain
    Weber, Lukas Jonathan
    Ramalingam, Krishnan Jothi
    Beyer, Matthias
    Liu, Chin
    Zimmermann, Axel
    2024 7TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND BIG DATA, ICAIBD 2024, 2024, : 213 - 217
  • [45] B-NER: A Novel Bangla Named Entity Recognition Dataset With Largest Entities and Its Baseline Evaluation
    Haque, Md. Zahidul
    Zaman, Sakib
    Saurav, Jillur Rahman
    Haque, Summit
    Islam, Md. Saiful
    Amin, Mohammad Ruhul
    IEEE ACCESS, 2023, 11 : 45194 - 45205
  • [46] SocialNER2.0: A comprehensive dataset for enhancing named entity recognition in short human-produced text
    Belbekri, Adel
    Benchikha, Fouzia
    Slimani, Yahya
    Marir, Naila
    INTELLIGENT DATA ANALYSIS, 2024, 28 (03) : 841 - 865
  • [47] Multi-Grained Named Entity Recognition
    Xia, Congying
    Zhang, Chenwei
    Yang, Tao
    Li, Yaliang
    Du, Nan
    Wu, Xian
    Fan, Wei
    Ma, Fenglong
    Yu, Philip
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 1430 - 1440
  • [48] Named Entity Recognition from Unstructured Handwritten Document Images
    Adak, Chandranath
    Chaudhuri, Bidyut B.
    Blumenstein, Michael
    PROCEEDINGS OF 12TH IAPR WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS, (DAS 2016), 2016, : 375 - 380
  • [49] A deep learning method for named entity recognition in bidding document
    Ji, Yunfei
    Tong, Chao
    Liang, Jun
    Yang, Xi
    Zhao, Zheng
    Wang, Xu
    2018 INTERNATIONAL CONFERENCE ON COMPUTER INFORMATION SCIENCE AND APPLICATION TECHNOLOGY, 2019, 1168
  • [50] Document Theme Extraction Using Named-Entity Recognition
    Nagrale, Deepali
    Khatavkar, Vaibhav
    Kulkarni, Parag
    COMPUTING, COMMUNICATION AND SIGNAL PROCESSING, ICCASP 2018, 2019, 810 : 499 - 509