Paraphrase Identification in Vietnamese Documents

被引:3
|
作者
Bach, Ngo Xuan [1 ,2 ]
Oanh, Tran Thi [3 ]
Hai, Nguyen Trung [1 ]
Phuong, Tu Minh [1 ,2 ]
机构
[1] Posts & Telecommun Inst Technol, Dept Comp Sci, Ho Chi Minh City, Vietnam
[2] Posts & Telecommun Inst Technol, Machine Learning & Applicat Lab, Ho Chi Minh City, Vietnam
[3] Vietnam Natl Univ, Int Sch, Hanoi, Vietnam
关键词
Paraphrase Identification; Semantic Similarity; Support Vector Machines; Maximum Entropy Model; Naive Bayes Classification; K-Nearest Neighbor;
D O I
10.1109/KSE.2015.37
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we investigate the task of paraphrase identification in Vietnamese documents, which identify whether two sentences have the same meaning. This task has been shown to be an important research dimension with practical applications in natural language processing and data mining. We choose to model the task as a classification problem and explore different types of features to represent sentences. We also introduce a paraphrase corpus for Vietnamese, vnPara, which consists of 3000 Vietnamese sentence pairs. We describe a series of experiments using various linguistic features and different machine learning algorithms, including Support Vector Machines, Maximum Entropy Model, Naive Bayes, and k-Nearest Neighbors. The results are promising with the best model achieving up to 90% accuracy. To the best of our knowledge, this is the first attempt to solve the task of paraphrase identification for Vietnamese.
引用
收藏
页码:174 / 179
页数:6
相关论文
共 50 条
  • [1] Vietnamese Paraphrase Identification Using Matching Duplicate Phrases and Similar Words
    Hoang-Quoc Nguyen-Son
    Nam-Phong Tran
    Ngoc-Vien Pham
    Minh-Triet Tran
    Echizen, Isao
    FUTURE DATA AND SECURITY ENGINEERING, FDSE 2018, 2018, 11251 : 172 - 182
  • [2] Vietnamese Sentence Paraphrase Identification using Pre-trained Model and Linguistic Knowledge
    Dien Dinh
    Nguyen Le Thanh
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2021, 12 (08) : 796 - 806
  • [3] English–Vietnamese cross-language paraphrase identification using hybrid feature classes
    Dien Dinh
    Nguyen Le Thanh
    Journal of Heuristics, 2022, 28 : 193 - 209
  • [4] English-Vietnamese cross-language paraphrase identification using hybrid feature classes
    Dien Dinh
    Nguyen Le Thanh
    JOURNAL OF HEURISTICS, 2022, 28 (02) : 193 - 209
  • [5] English-Vietnamese Cross-Lingual Paraphrase Identification Using MT-DNN
    Hung Vo Tran Chi
    Duy Lu Anh
    Nguyen Le Thanh
    Dien Dinh
    ENGINEERING TECHNOLOGY & APPLIED SCIENCE RESEARCH, 2021, 11 (05) : 7598 - 7604
  • [6] Vietnamese- English Cross-Lingual Paraphrase Identification Using Siamese Recurrent Architectures
    Le Thanh Nguyen
    Dinh Dien
    ISCIT 2019: PROCEEDINGS OF 2019 19TH INTERNATIONAL SYMPOSIUM ON COMMUNICATIONS AND INFORMATION TECHNOLOGIES (ISCIT), 2019, : 70 - 75
  • [7] On Paraphrase Identification Corpora
    Rus, Vasile
    Banjade, Rajendra
    Lintean, Mihai
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 2422 - 2429
  • [8] Text Categorization for Vietnamese Documents
    Nguyen, Giang-Son
    Gao, Xiaoying
    Andreae, Peter
    2009 IEEE/WIC/ACM INTERNATIONAL JOINT CONFERENCES ON WEB INTELLIGENCE (WI) AND INTELLIGENT AGENT TECHNOLOGIES (IAT), VOL 3, 2009, : 466 - 469
  • [9] Paraphrase Identification of Marathi Sentences
    Srivastava, Shruti
    Govilkar, Sharvari
    INTERNATIONAL CONFERENCE ON INTELLIGENT DATA COMMUNICATION TECHNOLOGIES AND INTERNET OF THINGS, ICICI 2018, 2019, 26 : 534 - 544
  • [10] ETPC - A Paraphrase Identification Corpus Annotated with Extended Paraphrase Typology and Negation
    Kovatchev, Venelin
    Antonia Marti, M.
    Salamo, Maria
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 1384 - 1392