A Hybrid Method for Word Segmentation with English-Vietnamese Bilingual Text

被引:0
|
作者
Quoc Hung Ngo [1 ]
Dinh Dien [2 ]
Winiwarter, Werner [3 ]
机构
[1] Univ Informat Technol, Fac Comp Sci, Ho Chi Minh City, Vietnam
[2] Univ Sci, Fac Informat Technol, Ho Chi Minh City, Vietnam
[3] Univ Vienna, Res Grp Data Analyt & Comp, A-1090 Vienna, Austria
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper proposes a hybrid approach for Vietnamese word segmentation. The approach combines a dictionary-based method and a machine learning method to detect word boundaries in Vietnamese text by comparing English-Vietnamese pairs. We also point out several characteristics of Vietnamese which affect the Vietnamese word segmentation task and word alignment of English-Vietnamese text. Moreover, we built an English-Vietnamese bilingual corpus with nearly 10 million words, namely EVBCorpus, while a part of EVBNews has been manually segmented at the word level. We evaluate the performance of our approach by comparing its word segmentation results on this corpus. Our hybrid approach achieves 97% accuracy on the EVBNews corpus.
引用
收藏
页数:5
相关论文
共 50 条
  • [31] English-Vietnamese Cross-Lingual Paraphrase Identification Using MT-DNN
    Hung Vo Tran Chi
    Duy Lu Anh
    Nguyen Le Thanh
    Dien Dinh
    ENGINEERING TECHNOLOGY & APPLIED SCIENCE RESEARCH, 2021, 11 (05) : 7598 - 7604
  • [32] A hybrid approach of text segmentation based on sensitive word concept for NLP
    Ren, F
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2001, 2004 : 375 - 388
  • [33] Grammatical Characteristics of Vietnamese and English in Developing Bilingual Children
    Quynh Dam
    Giang Pham
    Potapova, Irina
    Pruitt-Lord, Sonja
    AMERICAN JOURNAL OF SPEECH-LANGUAGE PATHOLOGY, 2020, 29 (03) : 1212 - 1225
  • [34] A High-Quality and Large-Scale Dataset for English-Vietnamese Speech Translation
    Linh The Nguyen
    Nguyen Luong Tran
    Long Doan
    Manh Luong
    Dat Quoc Nguyen
    INTERSPEECH 2022, 2022, : 1726 - 1730
  • [35] A Bilingual Vocabulary Size Test of English for Vietnamese Learners
    Le Thi Cam Nguyen
    Nation, Paul
    RELC JOURNAL, 2011, 42 (01) : 86 - 99
  • [36] Building Vietnamese Dependency Treebank Based on Chinese-Vietnamese Bilingual Word Alignment
    Li, Ying
    Guo, Jianyi
    Yu, Zhengtao
    Wang, Hongbin
    Wen, Yonghua
    2016 12TH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY (ICNC-FSKD), 2016, : 1330 - 1335
  • [37] English-Vietnamese Machine Translation of Proper Names Error Analysis and Some Proposed Solutions
    Thi Thanh Thao Phan
    Thomas, Izabella
    TEXT, SPEECH AND DIALOGUE, TSD 2012, 2012, 7499 : 386 - 393
  • [38] Word segmentation of Vietnamese texts: a comparison of approaches
    Dinh Quang Thang
    Le Hong Phuong
    Nguyen Thi Minh Huyen
    Nguyen Cam Tu
    Rossignol, Mathias
    Vu Xuan Luong
    SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 1933 - 1936
  • [39] Probabilistic Ensemble Learning for Vietnamese Word Segmentation
    Liu, Wuying
    Lin, Li
    SIGIR'14: PROCEEDINGS OF THE 37TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2014, : 931 - 934
  • [40] State-of-the-Art Vietnamese Word Segmentation
    Cong, Song Nguyen Duc
    Ngo, Quoc Hung
    Jiamthapthaksin, Rachsuda
    PROCEEDINGS OF 2016 2ND INTERNATIONAL CONFERENCE ON SCIENCE IN INFORMATION TECHNOLOGY (ICSITECH) - INFORMATION SCIENCE FOR GREEN SOCIETY AND ENVIRONMENT, 2016, : 119 - 124