Data Augmentation and Preparation Process of PerInfEx: A Persian Chatbot With the Ability of Information Extraction

被引：0

作者：

Safari, Pegah ^{[1
]}

Shamsfard, Mehrnoush ^{[1
]}

机构：

[1] Shahid Beheshti Univ, Fac Comp Sci & Engn, Tehran 1983969411, Iran

来源：

IEEE ACCESS | 2024年 / 12卷

关键词：

Chatbots; Oral communication; Training; Data augmentation; Semantics; Data mining; Natural languages; Data collection; data collection; dialogue generation; direct question; Persian open-domain chatbot; paraphrasing; personal information extraction; AGREEMENT;

D O I：

10.1109/ACCESS.2024.3360863

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In this paper, we describe data preparation for our proposed chatbot PerInfEx (Persian Information Extraction chatbot). It aims to interactively chit-chat with users in Persian and by asking the least number of direct questions, extract as much personal information as possible such as user's age or occupation. Collecting data in considerable size and aligned with our system's specifics is a crucial step to train data-hungry modules of Natural Language Understating (NLU) and Natural Language Generating (NLG). Initially, for NLU module, we collect 99 free-discussion dialogues and crawl 74 English training conversations as more-general datasets while also manually translate 72 dialogues of ConvAI2 corpus. Moreover, we gamify collection by implementing a chatting website results in 94 dialogues. It detects direct questions and assigns random profiles to participants. They should guess the opponents profile. Also, we propose two augmentation methods: a semi-automatic and a novel fully automatic method, comprehensively evaluated on NLU benchmarks and applied on our datasets. Also, by prompting OpenAI's GPT-3.5 model, we automatically generate 304 dialogues. The first part of these datasets is manually annotated while we use an active learning method for annotating rest of them. Next, to evaluate data quality, we assess them extrinsically using NLU baseline which results in intent-accuracy = 88.64, slot-F1 = 83.68 and exact-match = 78.22. Also, for NLG module, we automatically translate almost the rest of ConvAI2 corpus (16,217 dialogues) and paraphrase previously sets for its fine-tuning using GPT-3.5 model. Their assessment using our NLG baseline results in perplexity of 15.74 on train and 52.17 on test set.

引用

页码：19158 / 19180

页数：23

共 16 条

[1] Leveraging Data Augmentation for Process Information Extraction
Neuberger, Julian
Doll, Leonie
Engelmann, Benedikt
Ackermann, Lars
Jablonski, Stefan
[J]. ENTERPRISE, BUSINESS-PROCESS AND INFORMATION SYSTEMS MODELING, BPMDS 2024, EMMSAD 2024, 2024, 511 : 57 - 70
[2] Fault diagnosis strategy for few shot industrial process based on data augmentation and depth information extraction
Tian, Ying
Xiang, Xin
Peng, Xin
Yin, Zhong
Zhang, Wei
[J]. CANADIAN JOURNAL OF CHEMICAL ENGINEERING, 2023, 101 (08): : 4620 - 4639
[3] CoRI: Collective Relation Integration with Data Augmentation for Open Information Extraction
Jiang, Zhengbao
Han, Jialong
Sisman, Bunyamin
Dong, Xin Luna
[J]. 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 4706 - 4716
[4] Table Information Extraction Using Data Augmentation on Deep Learning and Image Processing
Zulkarnain, Izuardo
Nurmalasari, Rin Rin
Azizah, Fazat Nur
[J]. Proceeding of 2022 16th International Conference on Telecommunication Systems Services and Applications, TSSA 2022, 2022,
[5] Few-shot biomedical relation extraction using data augmentation and domain information
Guo, Bocheng
Zhao, Di
Dong, Xin
Meng, Jiana
Lin, Hongfei
[J]. NEUROCOMPUTING, 2024, 595
[6] Extraction of information on elder motor ability from clinical and biochemical data through data mining
Vannozzi, G.
Cereatti, A.
Mazza, C.
Benuenuti, F.
Della Croce, U.
[J]. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2007, 88 (01) : 85 - 94
[7] Observation Study on Precursory Geosound: Extraction of Nucleation Information in the Earthquake Preparation Process
Zheng Zhizhen
[J]. Earthquake Research Advances, 1999, (04) : 141 - 146
[8] Extraction of Data from a Hospital Information System to Perform Process Mining
Quintano Neira, Ricardo Alfredo
de Vries, Gert-Jan
Caffarel, Jennifer
Stretton, Erin
[J]. MEDINFO 2017: PRECISION HEALTHCARE THROUGH INFORMATICS, 2017, 245 : 554 - 558
[9] Pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification
Erdengasileng, Arslan
Han, Qing
Zhao, Tingting
Tian, Shubo
Sui, Xin
Li, Keqiao
Wang, Wanjing
Wang, Jian
Hu, Ting
Pan, Feng
Zhang, Yuan
Zhang, Jinfeng
[J]. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2022, 2022
[10] The Necessity of Information Extraction from Big Data Systems for the Purpose of Business Process Optimization
Shoilekova, Kamelia
Ivanova, Boyana
[J]. SOFTWARE ENGINEERING PERSPECTIVES IN SYSTEMS, VOL. 1, 2022, 501 : 48 - 54

← 1 2 →