Experiments on the use of corpus-based word BI-gram in Chinese word segmentation

被引：0

作者：

Xu, RF ^{[1
]}

Yeung, D ^{[1
]}

机构：

[1] Hong Kong Polytech Univ, Dept Comp, Kowloon, Hong Kong

来源：

1998 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5 | 1998年

关键词：

Chinese word segmentation; word BI-gram; corpus-based;

D O I：

暂无

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The first step of Chinese language processing is to segment a Chinese sentence into a sequence of words due to the fact that there is no original separation between adjacent words. An efficient corpus-based statistical method is adopted here to address such a problem. In this paper, some word BI-gram statistical measures derived from corpus are employed to remove the segmentation ambiguities. To segment a Chinese sentence, a bidirectional maximum matching method is firstly used to do pre-matching in order to get segmentation candidates and locate possible ambiguities. The statistical measures based on word Bi-gram information and word frequency will be used to construct a discriminate function, which is applied to ambiguity strings in order to get an utmost correct segmentation. Experimental results are analyzed to describe the features and limitations of this approach, and preliminary results indicate that our approach is compared favorably to other existing techniques.

引用

页码：4222 / 4227

页数：6

共 50 条

[1] Query by String word spotting based on character bi-gram indexing
Ghosh, Suman K.
Valveny, Ernest
[J]. 2015 13TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2015, : 881 - 885
[2] Combining bi-gram of character and word to classify two-class Chinese texts in two steps
Fan, Xinghua
Wan, Difei
Wang, Guoying
[J]. ROUGH SETS AND CURRENT TRENDS IN COMPUTING, PROCEEDINGS, 2006, 4259 : 597 - +
[3] Corpus-Based Textual Research on the Meanings of the Chinese Word "Xifu(r)"
Wang, Jingmin
[J]. CHINESE LEXICAL SEMANTICS (CLSW 2019), 2020, 11831 : 674 - 680
[4] AN APPROXIMATION ALGORITHM FOR WORD-REPLACEMENT USING A BI-GRAM LANGUAGE MODEL
He, Jing
Liang, Hongyu
[J]. 2009 IEEE YOUTH CONFERENCE ON INFORMATION, COMPUTING AND TELECOMMUNICATION, PROCEEDINGS, 2009, : 27 - 30
[5] The academic word list: A corpus-based word list for academic purposes
Coxhead, A
[J]. TEACHING AND LEARNING BY DOING CORPUS ANALYSIS, 2002, (42): : 73 - 80
[6] Applications of corpus-based semantic similarity and word segmentation to database schema matching
Aminul Islam
Diana Inkpen
Iluju Kiringa
[J]. The VLDB Journal, 2008, 17 : 1293 - 1320
[7] Applications of corpus-based semantic similarity and word segmentation to database schema matching
Islam, Aminul
Inkpen, Diana
Kiringa, Iluju
[J]. VLDB JOURNAL, 2008, 17 (05): : 1293 - 1320
[8] A corpus-based description of an emotion word and concept
Tissari, H
[J]. NEUPHILOLOGISCHE MITTEILUNGEN, 2005, 106 (01) : 89 - 91
[9] The Giver: A Corpus-Based Analysis of Word Frequencies
Brandenburg-Weeks, Tara
Abalkheel, Albatool Mohammed
[J]. 3L-LANGUAGE LINGUISTICS LITERATURE-THE SOUTHEAST ASIAN JOURNAL OF ENGLISH LANGUAGE STUDIES, 2021, 27 (03): : 215 - 227
[10] A Lexicon-Corpus-based Unsupervised Chinese Word Segmentation Approach
Lu Pengyu
Pu Jingchuan
Du Mingming
Lou Xiaojuan
Jin Lijun
[J]. INTERNATIONAL JOURNAL ON SMART SENSING AND INTELLIGENT SYSTEMS, 2014, 7 (01): : 263 - 282

← 1 2 3 4 5 →