Sequence-to-Sequence Models for Extracting Information from Registration and Legal Documents

被引：0

作者：

Pires, Ramon ^{[1
,2
]}

de Souza, Fabio C. ^{[1
,3
]}

Rosa, Guilherme ^{[1
,3
]}

Lotufo, Roberto A. ^{[1
,3
]}

Nogueira, Rodrigo ^{[1
,3
]}

机构：

[1] NeuralMind Inteligencia Artificial, Sao Paulo, SP, Brazil

[2] Univ Estadual Campinas, Inst Comp, Campinas, SP, Brazil

[3] Univ Estadual Campinas, Sch Elect & Comp Engn, Campinas, SP, Brazil

来源：

DOCUMENT ANALYSIS SYSTEMS, DAS 2022 | 2022年 / 13237卷

关键词：

Information extraction; Sequence-to-sequence; Legal texts;

D O I：

10.1007/978-3-031-06555-2_6

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

A typical information extraction pipeline consists of token- or span-level classification models coupled with a series of pre- and post-processing scripts. In a production pipeline, requirements often change, with classes being added and removed, which leads to nontrivial modifications to the source code and the possible introduction of bugs. In this work, we evaluate sequence-to-sequence models as an alternative to token-level classification methods for information extraction of legal and registration documents. We finetune models that jointly extract the information and generate the output already in a structured format. Post-processing steps are learned during training, thus eliminating the need for rule-based methods and simplifying the pipeline. Furthermore, we propose a novel method to align the output with the input text, thus facilitating system inspection and auditing. Our experiments on four real-world datasets show that the proposed method is an alternative to classical pipelines. The source code is available at https://github.com/neuralmind-ai/information-extraction-t5.

引用

页码：83 / 95

页数：13

共 50 条

[1] Sparse Sequence-to-Sequence Models
Peters, Ben
Niculae, Vlad
Martins, Andre F. T.
57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 1504 - 1519
[2] Assessing incrementality in sequence-to-sequence models
Ulmer, Dennis
Hupkes, Dieuwke
Bruni, Elia
4TH WORKSHOP ON REPRESENTATION LEARNING FOR NLP (REPL4NLP-2019), 2019, : 209 - 217
[3] An Analysis of "Attention" in Sequence-to-Sequence Models
Prabhavalkar, Rohit
Sainath, Tara N.
Li, Bo
Rao, Kanishka
Jaitly, Navdeep
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3702 - 3706
[4] EXTRACTING UNIT EMBEDDINGS USING SEQUENCE-TO-SEQUENCE ACOUSTIC MODELS FOR UNIT SELECTION SPEECH SYNTHESIS
Zhou, Xiao
Ling, Zhen-Hua
Dai, Li-Rong
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7659 - 7663
[5] Sequence-to-sequence AMR Parsing with Ancestor Information
Yu, Chen
Gildea, Daniel
PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): (SHORT PAPERS), VOL 2, 2022, : 571 - 577
[6] Deep Reinforcement Learning for Sequence-to-Sequence Models
Keneshloo, Yaser
Shi, Tian
Ramakrishnan, Naren
Reddy, Chandan K.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2020, 31 (07) : 2469 - 2489
[7] Sequence-to-Sequence Models for Automated Text Simplification
Botarleanu, Robert-Mihai
Dascalu, Mihai
Crossley, Scott Andrew
McNamara, Danielle S.
ARTIFICIAL INTELLIGENCE IN EDUCATION (AIED 2020), PT II, 2020, 12164 : 31 - 36
[8] Sequence-to-Sequence Models for Emphasis Speech Translation
Quoc Truong Do
Sakti, Sakriani
Nakamura, Satoshi
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (10) : 1873 - 1883
[9] Sequence-to-sequence Models for Cache Transition Systems
Peng, Xiaochang
Song, Linfeng
Gildea, Daniel
Satta, Giorgio
PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, 2018, : 1842 - 1852
[10] On Evaluation of Adversarial Perturbations for Sequence-to-Sequence Models
Michel, Paul
Li, Xian
Neubig, Graham
Pino, Juan Miguel
2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 3103 - 3114

← 1 2 3 4 5 →