Sequence-to-Sequence Models for Extracting Information from Registration and Legal Documents

被引:0
|
作者
Pires, Ramon [1 ,2 ]
de Souza, Fabio C. [1 ,3 ]
Rosa, Guilherme [1 ,3 ]
Lotufo, Roberto A. [1 ,3 ]
Nogueira, Rodrigo [1 ,3 ]
机构
[1] NeuralMind Inteligencia Artificial, Sao Paulo, SP, Brazil
[2] Univ Estadual Campinas, Inst Comp, Campinas, SP, Brazil
[3] Univ Estadual Campinas, Sch Elect & Comp Engn, Campinas, SP, Brazil
来源
关键词
Information extraction; Sequence-to-sequence; Legal texts;
D O I
10.1007/978-3-031-06555-2_6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A typical information extraction pipeline consists of token- or span-level classification models coupled with a series of pre- and post-processing scripts. In a production pipeline, requirements often change, with classes being added and removed, which leads to nontrivial modifications to the source code and the possible introduction of bugs. In this work, we evaluate sequence-to-sequence models as an alternative to token-level classification methods for information extraction of legal and registration documents. We finetune models that jointly extract the information and generate the output already in a structured format. Post-processing steps are learned during training, thus eliminating the need for rule-based methods and simplifying the pipeline. Furthermore, we propose a novel method to align the output with the input text, thus facilitating system inspection and auditing. Our experiments on four real-world datasets show that the proposed method is an alternative to classical pipelines. The source code is available at https://github.com/neuralmind-ai/information-extraction-t5.
引用
收藏
页码:83 / 95
页数:13
相关论文
共 50 条
  • [1] Sparse Sequence-to-Sequence Models
    Peters, Ben
    Niculae, Vlad
    Martins, Andre F. T.
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 1504 - 1519
  • [2] Assessing incrementality in sequence-to-sequence models
    Ulmer, Dennis
    Hupkes, Dieuwke
    Bruni, Elia
    4TH WORKSHOP ON REPRESENTATION LEARNING FOR NLP (REPL4NLP-2019), 2019, : 209 - 217
  • [3] An Analysis of "Attention" in Sequence-to-Sequence Models
    Prabhavalkar, Rohit
    Sainath, Tara N.
    Li, Bo
    Rao, Kanishka
    Jaitly, Navdeep
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3702 - 3706
  • [4] EXTRACTING UNIT EMBEDDINGS USING SEQUENCE-TO-SEQUENCE ACOUSTIC MODELS FOR UNIT SELECTION SPEECH SYNTHESIS
    Zhou, Xiao
    Ling, Zhen-Hua
    Dai, Li-Rong
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7659 - 7663
  • [5] Sequence-to-sequence AMR Parsing with Ancestor Information
    Yu, Chen
    Gildea, Daniel
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): (SHORT PAPERS), VOL 2, 2022, : 571 - 577
  • [6] Deep Reinforcement Learning for Sequence-to-Sequence Models
    Keneshloo, Yaser
    Shi, Tian
    Ramakrishnan, Naren
    Reddy, Chandan K.
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2020, 31 (07) : 2469 - 2489
  • [7] Sequence-to-Sequence Models for Automated Text Simplification
    Botarleanu, Robert-Mihai
    Dascalu, Mihai
    Crossley, Scott Andrew
    McNamara, Danielle S.
    ARTIFICIAL INTELLIGENCE IN EDUCATION (AIED 2020), PT II, 2020, 12164 : 31 - 36
  • [8] Sequence-to-Sequence Models for Emphasis Speech Translation
    Quoc Truong Do
    Sakti, Sakriani
    Nakamura, Satoshi
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (10) : 1873 - 1883
  • [9] Sequence-to-sequence Models for Cache Transition Systems
    Peng, Xiaochang
    Song, Linfeng
    Gildea, Daniel
    Satta, Giorgio
    PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, 2018, : 1842 - 1852
  • [10] On Evaluation of Adversarial Perturbations for Sequence-to-Sequence Models
    Michel, Paul
    Li, Xian
    Neubig, Graham
    Pino, Juan Miguel
    2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 3103 - 3114