Multi-Layout Invoice Document Dataset (MIDD): A Dataset for Named Entity Recognition

被引:5
|
作者
Baviskar, Dipali [1 ]
Ahirrao, Swati [1 ]
Kotecha, Ketan [2 ]
机构
[1] Symbiosis Int, Symbiosis Inst Technol, Pune 412115, Maharashtra, India
[2] Symbiosis Int, Symbiosis Ctr Appl Artificial Intelligence, Pune 412115, Maharashtra, India
关键词
Artificial Intelligence (AI); information extraction; Named Entity Recognition (NER); unstructured data;
D O I
10.3390/data6070078
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The day-to-day working of an organization produces a massive volume of unstructured data in the form of invoices, legal contracts, mortgage processing forms, and many more. Organizations can utilize the insights concealed in such unstructured documents for their operational benefit. However, analyzing and extracting insights from such numerous and complex unstructured documents is a tedious task. Hence, the research in this area is encouraging the development of novel frameworks and tools that can automate the key information extraction from unstructured documents. However, the availability of standard, best-quality, and annotated unstructured document datasets is a serious challenge for accomplishing the goal of extracting key information from unstructured documents. This work expedites the researcher's task by providing a high-quality, highly diverse, multi-layout, and annotated invoice documents dataset for extracting key information from unstructured documents. Researchers can use the proposed dataset for layout-independent unstructured invoice document processing and to develop an artificial intelligence (AI)-based tool to identify and extract named entities in the invoice documents. Our dataset includes 630 invoice document PDFs with four different layouts collected from diverse suppliers. As far as we know, our invoice dataset is the only openly available dataset comprising high-quality, highly diverse, multi-layout, and annotated invoice documents. DataSet: http://doi.org/10.5281/zenodo.5113009 DataSet License: License under which the dataset is made available (CC-BY-4.0).
引用
收藏
页数:10
相关论文
共 50 条
  • [1] Multi-Layout Unstructured Invoice Documents Dataset: A Dataset for Template-Free Invoice Processing and Its Evaluation Using AI Approaches
    Baviskar, Dipali
    Ahirrao, Swati
    Kotecha, Ketan
    IEEE ACCESS, 2021, 9 : 101494 - 101512
  • [2] A Named Entity Recognition Dataset for Turkish
    Kucuk, Dilek
    Kucuk, Dogan
    Arici, Nursal
    2016 24TH SIGNAL PROCESSING AND COMMUNICATION APPLICATION CONFERENCE (SIU), 2016, : 329 - 332
  • [3] Interpretable Multi-dataset Evaluation for Named Entity Recognition
    Fu, Jinlan
    Liu, Pengfei
    Neubig, Graham
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 6058 - 6069
  • [4] KazNERD: Kazakh Named Entity Recognition Dataset
    Yeshpanov, Rustem
    Khassanov, Yerbolat
    Varol, Huseyin Atakan
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 417 - 426
  • [5] DroNER: Dataset for drone named entity recognition
    Silalahi, Swardiantara
    Ahmad, Tohari
    Studiawan, Hudan
    DATA IN BRIEF, 2023, 48
  • [6] Creating a Dataset for Named Entity Recognition in the Archaeology Domain
    Brandsen, Alex
    Verberne, Suzan
    Wansleeben, Milco
    Lambers, Karsten
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 4573 - 4577
  • [7] SciCN: A Scientific Dataset for Chinese Named Entity Recognition
    Yang, Jing
    Ji, Bin
    Li, Shasha
    Ma, Jun
    Yu, Jie
    CMC-COMPUTERS MATERIALS & CONTINUA, 2024, 78 (03): : 4303 - 4315
  • [8] ViMedNER: A Medical Named Entity Recognition Dataset for Vietnamese
    Duong, Pham Van
    Trinh, Tien-Dat
    Nguyen, Minh-Tien
    Vu, Huy-The
    Pham, Minh-Chuan
    Tuan, Tran Manh
    Son, Le Hoang
    EAI Endorsed Transactions on Industrial Networks and Intelligent Systems, 2024, 11 (04)
  • [9] HiNER: A Large Hindi Named Entity Recognition Dataset
    Murthy, Rudra
    Bhattacharjee, Pallab
    Sharnagat, Rahul
    Khatri, Jyotsana
    Kanojia, Diptesh
    Bhattacharyya, Pushpak
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 4467 - 4476
  • [10] Towards a Standardized Dataset on Indonesian Named Entity Recognition
    Khairunnisa, Siti Oryza
    Imankulova, Aizhan
    Komachi, Mamoru
    AACL-IJCNLP 2020: THE 1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING: PROCEEDINGS OF THE STUDENT RESEARCH WORKSHOP, 2020, : 64 - 71