Data Augmentation and Preparation Process of PerInfEx: A Persian Chatbot With the Ability of Information Extraction

被引:0
|
作者
Safari, Pegah [1 ]
Shamsfard, Mehrnoush [1 ]
机构
[1] Shahid Beheshti Univ, Fac Comp Sci & Engn, Tehran 1983969411, Iran
关键词
Chatbots; Oral communication; Training; Data augmentation; Semantics; Data mining; Natural languages; Data collection; data collection; dialogue generation; direct question; Persian open-domain chatbot; paraphrasing; personal information extraction; AGREEMENT;
D O I
10.1109/ACCESS.2024.3360863
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we describe data preparation for our proposed chatbot PerInfEx (Persian Information Extraction chatbot). It aims to interactively chit-chat with users in Persian and by asking the least number of direct questions, extract as much personal information as possible such as user's age or occupation. Collecting data in considerable size and aligned with our system's specifics is a crucial step to train data-hungry modules of Natural Language Understating (NLU) and Natural Language Generating (NLG). Initially, for NLU module, we collect 99 free-discussion dialogues and crawl 74 English training conversations as more-general datasets while also manually translate 72 dialogues of ConvAI2 corpus. Moreover, we gamify collection by implementing a chatting website results in 94 dialogues. It detects direct questions and assigns random profiles to participants. They should guess the opponents profile. Also, we propose two augmentation methods: a semi-automatic and a novel fully automatic method, comprehensively evaluated on NLU benchmarks and applied on our datasets. Also, by prompting OpenAI's GPT-3.5 model, we automatically generate 304 dialogues. The first part of these datasets is manually annotated while we use an active learning method for annotating rest of them. Next, to evaluate data quality, we assess them extrinsically using NLU baseline which results in intent-accuracy = 88.64, slot-F1 = 83.68 and exact-match = 78.22. Also, for NLG module, we automatically translate almost the rest of ConvAI2 corpus (16,217 dialogues) and paraphrase previously sets for its fine-tuning using GPT-3.5 model. Their assessment using our NLG baseline results in perplexity of 15.74 on train and 52.17 on test set.
引用
收藏
页码:19158 / 19180
页数:23
相关论文
共 16 条
  • [1] Leveraging Data Augmentation for Process Information Extraction
    Neuberger, Julian
    Doll, Leonie
    Engelmann, Benedikt
    Ackermann, Lars
    Jablonski, Stefan
    [J]. ENTERPRISE, BUSINESS-PROCESS AND INFORMATION SYSTEMS MODELING, BPMDS 2024, EMMSAD 2024, 2024, 511 : 57 - 70
  • [2] Fault diagnosis strategy for few shot industrial process based on data augmentation and depth information extraction
    Tian, Ying
    Xiang, Xin
    Peng, Xin
    Yin, Zhong
    Zhang, Wei
    [J]. CANADIAN JOURNAL OF CHEMICAL ENGINEERING, 2023, 101 (08): : 4620 - 4639
  • [3] CoRI: Collective Relation Integration with Data Augmentation for Open Information Extraction
    Jiang, Zhengbao
    Han, Jialong
    Sisman, Bunyamin
    Dong, Xin Luna
    [J]. 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 4706 - 4716
  • [4] Table Information Extraction Using Data Augmentation on Deep Learning and Image Processing
    Zulkarnain, Izuardo
    Nurmalasari, Rin Rin
    Azizah, Fazat Nur
    [J]. Proceeding of 2022 16th International Conference on Telecommunication Systems Services and Applications, TSSA 2022, 2022,
  • [5] Few-shot biomedical relation extraction using data augmentation and domain information
    Guo, Bocheng
    Zhao, Di
    Dong, Xin
    Meng, Jiana
    Lin, Hongfei
    [J]. NEUROCOMPUTING, 2024, 595
  • [6] Extraction of information on elder motor ability from clinical and biochemical data through data mining
    Vannozzi, G.
    Cereatti, A.
    Mazza, C.
    Benuenuti, F.
    Della Croce, U.
    [J]. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2007, 88 (01) : 85 - 94
  • [8] Extraction of Data from a Hospital Information System to Perform Process Mining
    Quintano Neira, Ricardo Alfredo
    de Vries, Gert-Jan
    Caffarel, Jennifer
    Stretton, Erin
    [J]. MEDINFO 2017: PRECISION HEALTHCARE THROUGH INFORMATICS, 2017, 245 : 554 - 558
  • [9] Pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification
    Erdengasileng, Arslan
    Han, Qing
    Zhao, Tingting
    Tian, Shubo
    Sui, Xin
    Li, Keqiao
    Wang, Wanjing
    Wang, Jian
    Hu, Ting
    Pan, Feng
    Zhang, Yuan
    Zhang, Jinfeng
    [J]. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2022, 2022
  • [10] The Necessity of Information Extraction from Big Data Systems for the Purpose of Business Process Optimization
    Shoilekova, Kamelia
    Ivanova, Boyana
    [J]. SOFTWARE ENGINEERING PERSPECTIVES IN SYSTEMS, VOL. 1, 2022, 501 : 48 - 54