A Hybrid Method for Word Segmentation with English-Vietnamese Bilingual Text

被引:0
|
作者
Quoc Hung Ngo [1 ]
Dinh Dien [2 ]
Winiwarter, Werner [3 ]
机构
[1] Univ Informat Technol, Fac Comp Sci, Ho Chi Minh City, Vietnam
[2] Univ Sci, Fac Informat Technol, Ho Chi Minh City, Vietnam
[3] Univ Vienna, Res Grp Data Analyt & Comp, A-1090 Vienna, Austria
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper proposes a hybrid approach for Vietnamese word segmentation. The approach combines a dictionary-based method and a machine learning method to detect word boundaries in Vietnamese text by comparing English-Vietnamese pairs. We also point out several characteristics of Vietnamese which affect the Vietnamese word segmentation task and word alignment of English-Vietnamese text. Moreover, we built an English-Vietnamese bilingual corpus with nearly 10 million words, namely EVBCorpus, while a part of EVBNews has been manually segmented at the word level. We evaluate the performance of our approach by comparing its word segmentation results on this corpus. Our hybrid approach achieves 97% accuracy on the EVBNews corpus.
引用
收藏
页数:5
相关论文
共 50 条
  • [1] Building an English-Vietnamese Bilingual Corpus for Machine Translation
    Quoc Hung Ngo
    Winiwarter, Werner
    2012 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP 2012), 2012, : 157 - 160
  • [2] Projecting dependency syntax labels from english into Vietnamese in English-Vietnamese bilingual corpus
    Tran P.
    Duong V.-D.
    Dinh D.
    Vo B.
    Nguyen H.
    Nguyen L.H.B.
    International Journal of Intelligent Information and Database Systems, 2020, 13 (01) : 17 - 32
  • [3] Building a Named Entity Annotated Bilingual English-Vietnamese Corpus
    Tuan-An Dao
    Hung-Thinh Truong
    Long Nguyen
    Dien Dinh
    PROCEEDINGS OF 2018 10TH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE), 2018, : 61 - 66
  • [4] Rule based English-Vietnamese bilingual terminology extraction from Vietnamese documents
    Ha Nguyen Tien
    Quyen Ngo The
    Huyen Nguyen Thi Minh
    Linh Ha My
    SOICT 2019: PROCEEDINGS OF THE TENTH INTERNATIONAL SYMPOSIUM ON INFORMATION AND COMMUNICATION TECHNOLOGY, 2019, : 56 - 62
  • [5] An Approach to Construct a Named Entity Annotated English-Vietnamese Bilingual Corpus
    Nguyen, Long H. B.
    Dien Dinh
    Phuoc Tran
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2016, 16 (02)
  • [6] Dictionaries for English-Vietnamese Machine Translation
    Hai, Le Manh
    Thanh, Nguyen Chanh
    Hieu, Nguyen Chi
    Tuoi, Phan Thi
    COMPUTER PROCESSING OF ORIENTAL LANGUAGES, PROCEEDINGS: BEYOND THE ORIENT: THE RESEARCH CHALLENGES AHEAD, 2006, 4285 : 363 - +
  • [7] A Hybrid Approach to Vietnamese Word Segmentation
    Tuan-Phong Nguyen
    Anh-Cuong Le
    2016 IEEE RIVF INTERNATIONAL CONFERENCE ON COMPUTING & COMMUNICATION TECHNOLOGIES, RESEARCH, INNOVATION, AND VISION FOR THE FUTURE (RIVF), 2016, : 114 - 119
  • [8] ENGLISH-VIETNAMESE DICTIONARY - VANKHON,N
    JONES, RB
    JOURNAL OF ASIAN STUDIES, 1956, 15 (04): : 610 - 611
  • [9] A Data Augmentation Method for English-Vietnamese Neural Machine Translation
    Pham, Nghia Luan
    Nguyen, Van Vinh
    Pham, Thang Viet
    IEEE ACCESS, 2023, 11 : 28034 - 28044
  • [10] A Hybrid Approach to Word Segmentation of Vietnamese Texts
    Phuong, Le Hong
    Nguyen Thi Minh Huyen
    Roussanaly, Azim
    Ho Tuong Vinh
    LANGUAGE AND AUTOMATA THEORY AND APPLICATIONS, 2008, 5196 : 240 - +