Experiments on the use of corpus-based word BI-gram in Chinese word segmentation

被引:0
|
作者
Xu, RF [1 ]
Yeung, D [1 ]
机构
[1] Hong Kong Polytech Univ, Dept Comp, Kowloon, Hong Kong
关键词
Chinese word segmentation; word BI-gram; corpus-based;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The first step of Chinese language processing is to segment a Chinese sentence into a sequence of words due to the fact that there is no original separation between adjacent words. An efficient corpus-based statistical method is adopted here to address such a problem. In this paper, some word BI-gram statistical measures derived from corpus are employed to remove the segmentation ambiguities. To segment a Chinese sentence, a bidirectional maximum matching method is firstly used to do pre-matching in order to get segmentation candidates and locate possible ambiguities. The statistical measures based on word Bi-gram information and word frequency will be used to construct a discriminate function, which is applied to ambiguity strings in order to get an utmost correct segmentation. Experimental results are analyzed to describe the features and limitations of this approach, and preliminary results indicate that our approach is compared favorably to other existing techniques.
引用
收藏
页码:4222 / 4227
页数:6
相关论文
共 50 条
  • [1] Query by String word spotting based on character bi-gram indexing
    Ghosh, Suman K.
    Valveny, Ernest
    [J]. 2015 13TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2015, : 881 - 885
  • [2] Combining bi-gram of character and word to classify two-class Chinese texts in two steps
    Fan, Xinghua
    Wan, Difei
    Wang, Guoying
    [J]. ROUGH SETS AND CURRENT TRENDS IN COMPUTING, PROCEEDINGS, 2006, 4259 : 597 - +
  • [3] Corpus-Based Textual Research on the Meanings of the Chinese Word "Xifu(r)"
    Wang, Jingmin
    [J]. CHINESE LEXICAL SEMANTICS (CLSW 2019), 2020, 11831 : 674 - 680
  • [4] AN APPROXIMATION ALGORITHM FOR WORD-REPLACEMENT USING A BI-GRAM LANGUAGE MODEL
    He, Jing
    Liang, Hongyu
    [J]. 2009 IEEE YOUTH CONFERENCE ON INFORMATION, COMPUTING AND TELECOMMUNICATION, PROCEEDINGS, 2009, : 27 - 30
  • [5] The academic word list: A corpus-based word list for academic purposes
    Coxhead, A
    [J]. TEACHING AND LEARNING BY DOING CORPUS ANALYSIS, 2002, (42): : 73 - 80
  • [6] Applications of corpus-based semantic similarity and word segmentation to database schema matching
    Aminul Islam
    Diana Inkpen
    Iluju Kiringa
    [J]. The VLDB Journal, 2008, 17 : 1293 - 1320
  • [7] Applications of corpus-based semantic similarity and word segmentation to database schema matching
    Islam, Aminul
    Inkpen, Diana
    Kiringa, Iluju
    [J]. VLDB JOURNAL, 2008, 17 (05): : 1293 - 1320
  • [9] The Giver: A Corpus-Based Analysis of Word Frequencies
    Brandenburg-Weeks, Tara
    Abalkheel, Albatool Mohammed
    [J]. 3L-LANGUAGE LINGUISTICS LITERATURE-THE SOUTHEAST ASIAN JOURNAL OF ENGLISH LANGUAGE STUDIES, 2021, 27 (03): : 215 - 227
  • [10] A Lexicon-Corpus-based Unsupervised Chinese Word Segmentation Approach
    Lu Pengyu
    Pu Jingchuan
    Du Mingming
    Lou Xiaojuan
    Jin Lijun
    [J]. INTERNATIONAL JOURNAL ON SMART SENSING AND INTELLIGENT SYSTEMS, 2014, 7 (01): : 263 - 282