A Hybrid Method for Word Segmentation with English-Vietnamese Bilingual Text

被引:0
|
作者
Quoc Hung Ngo [1 ]
Dinh Dien [2 ]
Winiwarter, Werner [3 ]
机构
[1] Univ Informat Technol, Fac Comp Sci, Ho Chi Minh City, Vietnam
[2] Univ Sci, Fac Informat Technol, Ho Chi Minh City, Vietnam
[3] Univ Vienna, Res Grp Data Analyt & Comp, A-1090 Vienna, Austria
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper proposes a hybrid approach for Vietnamese word segmentation. The approach combines a dictionary-based method and a machine learning method to detect word boundaries in Vietnamese text by comparing English-Vietnamese pairs. We also point out several characteristics of Vietnamese which affect the Vietnamese word segmentation task and word alignment of English-Vietnamese text. Moreover, we built an English-Vietnamese bilingual corpus with nearly 10 million words, namely EVBCorpus, while a part of EVBNews has been manually segmented at the word level. We evaluate the performance of our approach by comparing its word segmentation results on this corpus. Our hybrid approach achieves 97% accuracy on the EVBNews corpus.
引用
收藏
页数:5
相关论文
共 50 条
  • [41] Identifying Reduplicative Words for Vietnamese Word Segmentation
    Ngoc Anh Tran
    Phuong Thai Nguyen
    Thanh Tinh Dao
    Hong Quan Nguyen
    2015 IEEE RIVF INTERNATIONAL CONFERENCE ON COMPUTING & COMMUNICATION TECHNOLOGIES - RESEARCH, INNOVATION, AND VISION FOR THE FUTURE (RIVF), 2015, : 77 - 82
  • [42] The role of word familiarity in Spanish/English bilingual word recognition
    Shi, Lu-Feng
    Sanchez, Diana
    INTERNATIONAL JOURNAL OF AUDIOLOGY, 2011, 50 (02) : 66 - 76
  • [43] A Hybrid Method for Bilingual Text Sentiment Classification Based on Deep Learning
    Liu, Guolong
    Xu, Xiaofei
    Deng, Bailong
    Chen, Siding
    Li, Li
    2016 17TH IEEE/ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING (SNPD), 2016, : 93 - 98
  • [44] Comprehensive printed Tibetan/english mixed text segmentation method
    Wang, H
    Ding, XQ
    DOCUMENT REGOGNITION AND RETRIEVAL XI, 2004, 5296 : 136 - 146
  • [45] SELECT BILINGUAL - THE SPANISH ENGLISH WORD PROCESSOR
    KLEMME, WH
    MODERN LANGUAGE JOURNAL, 1985, 69 (02): : 204 - 205
  • [46] One Novel Word Segmentation Method Based on N-Shortest Path in Vietnamese
    Ke, Xiaohua
    Luo, Haijiao
    Chen, Jihua
    Huang, Ruibin
    Lai, Jinwen
    ADVANCES IN COMPUTER COMMUNICATION AND COMPUTATIONAL SCIENCES, IC4S 2018, 2019, 924 : 549 - 557
  • [47] Chinese And English Bilingual Scene Text Detection
    Sha, Yuan
    Shi, Ping
    You, Jian
    Bao, Xiaojie
    Fu, Sizhe
    Zeng, Guoxiang
    2017 IEEE 3RD INFORMATION TECHNOLOGY AND MECHATRONICS ENGINEERING CONFERENCE (ITOEC), 2017, : 499 - 503
  • [48] Deep Neural Networks Algorithm for Vietnamese Word Segmentation
    Zheng, Kexiao
    Zheng, Wenkui
    SCIENTIFIC PROGRAMMING, 2022, 2022
  • [49] Identifying Coordinated Compound Words for Vietnamese Word Segmentation
    Ngoc Anh Tran
    Thanh Tinh Dao
    Phuong Thai Nguyen
    2013 INTERNATIONAL CONFERENCE OF SOFT COMPUTING AND PATTERN RECOGNITION (SOCPAR), 2013, : 31 - 36
  • [50] Vietnamese word segmentation with.CRFs and SVMs: An investigation
    Nguyen, Cam-Tu
    Nguyen, Trung-Kien
    Phan, Xuan-Hieu
    Nguyen, Le-Minh
    Ha, Quang-Thuy
    PACLIC 20: PROCEEDINGS OF THE 20TH PACIFIC ASIA CONFERENCE ON LANGUAGE, INFORMATION AND COMPUTATION, 2006, : 215 - 222