An Unsupervised Learning and Statistical Approach for Vietnamese Word Recognition and Segmentation

被引:0
|
作者
Trung, Hieu Le [1 ]
Vu Le Anh [2 ]
Trung, Kien Le [3 ]
机构
[1] St Petersburg State Univ, St Petersburg, Russia
[2] Hoa Sen Univ, Ho Chi Minh City, Vietnam
[3] Ernst Moritz Arndt Univ Greifswald, Inst Math, Greifswald, Germany
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
There are two main topics in this paper: (i) Vietnamese words are recognized and sentences are segmented into words by using probabilistic models; (ii) the optimum probabilistic model is constructed by an unsupervised learning processing. For each probabilistic model, new words are recognized and their syllables are linked together. The syllable-linking process improves the accuracy of statistical functions which improves contrarily the new words recognition. Hence, the probabilistic model will converge to the optimum one. Our experimented corpus is generated from about 250.000 online news articles, which consist of about 19.000.000 sentences. The accuracy of the segmented algorithm is over 90%. Our Vietnamese word and phrase dictionary contains more than 150.000 elements.
引用
收藏
页码:195 / +
页数:2
相关论文
共 50 条
  • [1] HMMs for Unsupervised Vietnamese Word Segmentation
    Ba-Long Bui
    Thi-Trang Nguyen
    Huu-Hoang Nguyen
    Kiem-Hieu Nguyen
    2019 IEEE - RIVF INTERNATIONAL CONFERENCE ON COMPUTING AND COMMUNICATION TECHNOLOGIES (RIVF), 2019, : 284 - 289
  • [2] A Hybrid Approach to Vietnamese Word Segmentation
    Tuan-Phong Nguyen
    Anh-Cuong Le
    2016 IEEE RIVF INTERNATIONAL CONFERENCE ON COMPUTING & COMMUNICATION TECHNOLOGIES, RESEARCH, INNOVATION, AND VISION FOR THE FUTURE (RIVF), 2016, : 114 - 119
  • [3] A New Unsupervised Approach to Word Segmentation
    Wang, Hanshi
    Zhu, Jian
    Tang, Shiping
    Fan, Xiaozhong
    COMPUTATIONAL LINGUISTICS, 2011, 37 (03) : 421 - 454
  • [4] An Improved Unsupervised Approach to Word Segmentation
    WANG Hanshi
    HAN Xuhong
    LIU Lizhen
    SONG Wei
    YUAN Mudan
    中国通信, 2015, 12 (07) : 82 - 95
  • [5] An Improved Unsupervised Approach to Word Segmentation
    Wang Hanshi
    Han Xuhong
    Liu Lizhen
    Song Wei
    Yuan Mudan
    CHINA COMMUNICATIONS, 2015, 12 (07) : 82 - 95
  • [6] Unsupervised Ensemble Learning for Vietnamese Multisyllabic Word Extraction
    Liu, Wuying
    Wang, Lin
    PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2016, : 353 - 357
  • [7] Probabilistic Ensemble Learning for Vietnamese Word Segmentation
    Liu, Wuying
    Lin, Li
    SIGIR'14: PROCEEDINGS OF THE 37TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2014, : 931 - 934
  • [8] A Hybrid Approach to Word Segmentation of Vietnamese Texts
    Phuong, Le Hong
    Nguyen Thi Minh Huyen
    Roussanaly, Azim
    Ho Tuong Vinh
    LANGUAGE AND AUTOMATA THEORY AND APPLICATIONS, 2008, 5196 : 240 - +
  • [9] Isarn Dharma Word Segmentation Using a Statistical Approach with Named Entity Recognition
    Somsap, Sittichai
    Seresangtakul, Pusadee
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2020, 19 (02)
  • [10] Mongolian word segmentation system based on unsupervised statistical model
    Wang, Siriguleng
    Bao, Meirong
    Arong
    INFORMATION SCIENCE AND MANAGEMENT ENGINEERING, VOLS 1-3, 2014, 46 : 707 - 714