Automatic Corpus Extension for Data-driven Natural Language Generation

被引:0
|
作者
Manishina, Elena [1 ]
Jabaian, Bassam [1 ]
Huet, Stephane [1 ]
Lefevre, Fabrice [1 ]
机构
[1] Univ Avignon, LIA CERI, Avignon, France
关键词
corpus building; natural language generation; automatic paraphrasing;
D O I
暂无
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
As data-driven approaches started to make their way into the Natural Language Generation (NLG) domain, the need for automation of corpus building and extension became apparent. Corpus creation and extension in data-driven NLG domain traditionally involved manual paraphrasing performed by either a group of experts or with resort to crowd-sourcing. Building the training corpora manually is a costly enterprise which requires a lot of time and human resources. We propose to automate the process of corpus extension by integrating automatically obtained synonyms and paraphrases. Our methodology allowed us to significantly increase the size of the training corpus and its level of variability (the number of distinct tokens and specific syntactic structures). Our extension solutions are fully automatic and require only some initial validation. The human evaluation results confirm that in many cases native users favor the outputs of the model built on the extended corpus.
引用
收藏
页码:3624 / 3631
页数:8
相关论文
共 50 条
  • [1] AutoQubo: Data-driven automatic QUBO generation
    Moraglio, Alberto
    Georgescu, Serban
    Sadowski, Przemyslaw
    [J]. PROCEEDINGS OF THE 2022 GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE COMPANION, GECCO 2022, 2022, : 2232 - 2239
  • [2] Context-Sensitive Natural Language Generation: From Knowledge-Driven to Data-Driven Techniques
    Dethlefs, Nina
    [J]. LANGUAGE AND LINGUISTICS COMPASS, 2014, 8 (03): : 99 - 115
  • [3] Data-Driven Broad-Coverage Grammars for Opinionated Natural Language Generation (ONLG)
    Cagan, Tomer
    Frank, Stefan L.
    Tsarfaty, Reut
    [J]. PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 1331 - 1341
  • [4] Deep JS']JSLC: A Multimodal Corpus Collection for Data-driven Generation of Japanese Sign Language Expressions
    Brock, Heike
    Nakadai, Kazuhiro
    [J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 4247 - 4252
  • [5] FROM CORPUS TO LANGUAGE CURRICULUM: A DATA-BASED OR DATA-DRIVEN EXERCISE?
    McNeill, Arthur
    [J]. EDULEARN11: 3RD INTERNATIONAL CONFERENCE ON EDUCATION AND NEW LEARNING TECHNOLOGIES, 2011, : 519 - 525
  • [6] Constructing a Data-Driven Model of English Language Teaching with a Multidimensional Corpus
    Chen, Dongyan
    [J]. MATHEMATICAL PROBLEMS IN ENGINEERING, 2022, 2022
  • [7] Data-driven Numerical Invariant Synthesis with Automatic Generation of Attributes
    Bouajjani, Ahmed
    Boutglay, Wael-Amine
    Habermehl, Peter
    [J]. COMPUTER AIDED VERIFICATION (CAV 2022), PT I, 2022, 13371 : 282 - 303
  • [8] Data-driven Automatic Generation Control capacity prediction method
    Wang, Shuo
    Kong, Xiangyu
    Liu, Mao
    Shi, Haobo
    Wang, Xi
    Dai, Qian
    [J]. 2022 25TH INTERNATIONAL CONFERENCE ON ELECTRICAL MACHINES AND SYSTEMS (ICEMS 2022), 2022,
  • [9] Data-Driven and Ontological Analysis of FrameNet for Natural Language Reasoning
    Ovchinnikova, Ekaterina
    Vieu, Laure
    Oltramari, Alessandro
    Borgo, Stefano
    Alexandrov, Theodore
    [J]. LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010,
  • [10] Foreign Language Writing Based on Corpus- based Data-driven Learning
    Tian, Xueqin
    [J]. PROCEEDINGS OF THE 2016 4TH INTERNATIONAL CONFERENCE ON MANAGEMENT SCIENCE, EDUCATION TECHNOLOGY, ARTS, SOCIAL SCIENCE AND ECONOMICS (MSETASSE-16), 2016, 85 : 492 - 495