A Hybrid Method for Word Segmentation with English-Vietnamese Bilingual Text
被引:0
|
作者:
论文数: 引用数:
h-index:
机构:
Quoc Hung Ngo
[1
]
Dinh Dien
论文数: 0引用数: 0
h-index: 0
机构:
Univ Sci, Fac Informat Technol, Ho Chi Minh City, VietnamUniv Informat Technol, Fac Comp Sci, Ho Chi Minh City, Vietnam
Dinh Dien
[2
]
Winiwarter, Werner
论文数: 0引用数: 0
h-index: 0
机构:
Univ Vienna, Res Grp Data Analyt & Comp, A-1090 Vienna, AustriaUniv Informat Technol, Fac Comp Sci, Ho Chi Minh City, Vietnam
Winiwarter, Werner
[3
]
机构:
[1] Univ Informat Technol, Fac Comp Sci, Ho Chi Minh City, Vietnam
[2] Univ Sci, Fac Informat Technol, Ho Chi Minh City, Vietnam
[3] Univ Vienna, Res Grp Data Analyt & Comp, A-1090 Vienna, Austria
来源:
2013 INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION AND INFORMATION SCIENCES (ICCAIS)
|
2013年
关键词:
D O I:
暂无
中图分类号:
TP [自动化技术、计算机技术];
学科分类号:
0812 ;
摘要:
This paper proposes a hybrid approach for Vietnamese word segmentation. The approach combines a dictionary-based method and a machine learning method to detect word boundaries in Vietnamese text by comparing English-Vietnamese pairs. We also point out several characteristics of Vietnamese which affect the Vietnamese word segmentation task and word alignment of English-Vietnamese text. Moreover, we built an English-Vietnamese bilingual corpus with nearly 10 million words, namely EVBCorpus, while a part of EVBNews has been manually segmented at the word level. We evaluate the performance of our approach by comparing its word segmentation results on this corpus. Our hybrid approach achieves 97% accuracy on the EVBNews corpus.