Sequence-to-Sequence Models for Extracting Information from Registration and Legal Documents

被引:0
|
作者
Pires, Ramon [1 ,2 ]
de Souza, Fabio C. [1 ,3 ]
Rosa, Guilherme [1 ,3 ]
Lotufo, Roberto A. [1 ,3 ]
Nogueira, Rodrigo [1 ,3 ]
机构
[1] NeuralMind Inteligencia Artificial, Sao Paulo, SP, Brazil
[2] Univ Estadual Campinas, Inst Comp, Campinas, SP, Brazil
[3] Univ Estadual Campinas, Sch Elect & Comp Engn, Campinas, SP, Brazil
来源
关键词
Information extraction; Sequence-to-sequence; Legal texts;
D O I
10.1007/978-3-031-06555-2_6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A typical information extraction pipeline consists of token- or span-level classification models coupled with a series of pre- and post-processing scripts. In a production pipeline, requirements often change, with classes being added and removed, which leads to nontrivial modifications to the source code and the possible introduction of bugs. In this work, we evaluate sequence-to-sequence models as an alternative to token-level classification methods for information extraction of legal and registration documents. We finetune models that jointly extract the information and generate the output already in a structured format. Post-processing steps are learned during training, thus eliminating the need for rule-based methods and simplifying the pipeline. Furthermore, we propose a novel method to align the output with the input text, thus facilitating system inspection and auditing. Our experiments on four real-world datasets show that the proposed method is an alternative to classical pipelines. The source code is available at https://github.com/neuralmind-ai/information-extraction-t5.
引用
收藏
页码:83 / 95
页数:13
相关论文
共 50 条
  • [21] SUPERVISED ATTENTION IN SEQUENCE-TO-SEQUENCE MODELS FOR SPEECH RECOGNITION
    Yang, Gene-Ping
    Tang, Hao
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7222 - 7226
  • [22] Exploring Sequence-to-Sequence Models for SPARQL Pattern Composition
    Panchbhai, Anand
    Soru, Tommaso
    Marx, Edgard
    KNOWLEDGE GRAPHS AND SEMANTIC WEB, KGSWC 2020, 2020, 1232 : 158 - 165
  • [23] Neural Abstractive Text Summarization with Sequence-to-Sequence Models
    Shi, Tian
    Keneshloo, Yaser
    Ramakrishnan, Naren
    Reddy, Chandan K.
    ACM/IMS Transactions on Data Science, 2021, 2 (01):
  • [24] Analyzing Adversarial Attacks on Sequence-to-Sequence Relevance Models
    Parry, Andrew
    Froebe, Maik
    MacAvaney, Sean
    Potthast, Martin
    Hagen, Matthias
    ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT II, 2024, 14609 : 286 - 302
  • [25] Predicting the Mumble of Wireless Channel with Sequence-to-Sequence Models
    Huangfu, Yourui
    Wang, Jian
    Li, Rong
    Xu, Chen
    Wang, Xianbin
    Zhang, Huazi
    Wang, Jun
    2019 IEEE 30TH ANNUAL INTERNATIONAL SYMPOSIUM ON PERSONAL, INDOOR AND MOBILE RADIO COMMUNICATIONS (PIMRC), 2019, : 1043 - 1049
  • [26] ACOUSTIC-TO-WORD RECOGNITION WITH SEQUENCE-TO-SEQUENCE MODELS
    Palaskar, Shruti
    Metze, Florian
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 397 - 404
  • [27] Persian Keyphrase Generation Using Sequence-to-sequence Models
    Doostmohammadi, Ehsan
    Bokaei, Mohammad Hadi
    Sameti, Hossein
    2019 27TH IRANIAN CONFERENCE ON ELECTRICAL ENGINEERING (ICEE 2019), 2019, : 2010 - 2015
  • [28] Sequence-to-Sequence Models Can Directly Translate Foreign Speech
    Weiss, Ron J.
    Chorowski, Jan
    Jaitly, Navdeep
    Wu, Yonghui
    Chen, Zhifeng
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2625 - 2629
  • [29] Sequence-to-Sequence Models and Their Evaluation for Spoken Language Normalization of Slovenian
    Sepesy Maučec, Mirjam
    Verdonik, Darinka
    Donaj, Gregor
    Applied Sciences (Switzerland), 2024, 14 (20):
  • [30] Unleashing the True Potential of Sequence-to-Sequence Models for Sequence Tagging and Structure Parsing
    He, Han
    Choi, Jinho D.
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2023, 11 : 582 - 599