Vari-gram language model based on word clustering

被引:0
|
作者
Li-chi Yuan
机构
[1] Jiangxi University of Finance and Economics,School of Information Technology
[2] Central South University,School of Information Science and Engineering
来源
关键词
word similarity; word clustering; statistical language model; vari-gram language model;
D O I
暂无
中图分类号
学科分类号
摘要
Category-based statistic language model is an important method to solve the problem of sparse data. But there are two bottlenecks: 1) The problem of word clustering. It is hard to find a suitable clustering method with good performance and less computation. 2) Class-based method always loses the prediction ability to adapt the text in different domains. In order to solve above problems, a definition of word similarity by utilizing mutual information was presented. Based on word similarity, the definition of word set similarity was given. Experiments show that word clustering algorithm based on similarity is better than conventional greedy clustering method in speed and performance, and the perplexity is reduced from 283 to 218. At the same time, an absolute weighted difference method was presented and was used to construct vari-gram language model which has good prediction ability. The perplexity of vari-gram model is reduced from 234.65 to 219.14 on Chinese corpora, and is reduced from 195.56 to 184.25 on English corpora compared with category-based model.
引用
收藏
页码:1057 / 1062
页数:5
相关论文
共 50 条
  • [11] A language independent n-gram model for word segmentation
    Kang, Seung-Shik
    Hwang, Kyu-Baek
    [J]. Lect. Notes Comput. Sci., 1600, (557-565):
  • [12] A language independent n-gram model for word segmentation
    Kang, Seung-Shik
    Hwang, Kyu-Baek
    [J]. AI 2006: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 4304 : 557 - +
  • [13] A Corpus Based Unsupervised Bangla Word Stemming Using N-Gram Language Model
    Urmi, Tapashee Tabassum
    Jammy, Jasmine Jahan
    Ismail, Sabir
    [J]. 2016 5TH INTERNATIONAL CONFERENCE ON INFORMATICS, ELECTRONICS AND VISION (ICIEV), 2016, : 824 - 828
  • [14] A MODEL FOR WORD CLUSTERING
    THOM, JA
    ZOBEL, J
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1992, 43 (09): : 616 - 627
  • [15] Language model based arabic word segmentation
    Lee, YS
    Papineni, K
    Roukos, S
    Emam, O
    Hassan, H
    [J]. 41ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, 2003, : 399 - 406
  • [16] Statistical language models of Lithuanian based on word clustering and morphological decomposition
    Vaiciunas, A
    Kaminskas, V
    Raskinis, G
    [J]. INFORMATICA, 2004, 15 (04) : 565 - 580
  • [17] Clustering words for statistical language models based on contextual word similarity
    Farhat, A
    Isabelle, JF
    OShaughnessy, D
    [J]. 1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 180 - 183
  • [18] AN APPROXIMATION ALGORITHM FOR WORD-REPLACEMENT USING A BI-GRAM LANGUAGE MODEL
    He, Jing
    Liang, Hongyu
    [J]. 2009 IEEE YOUTH CONFERENCE ON INFORMATION, COMPUTING AND TELECOMMUNICATION, PROCEEDINGS, 2009, : 27 - 30
  • [19] MiNgMatch-A Fast N-gram Model for Word Segmentation of the Ainu Language
    Nowakowski, Karol
    Ptaszynski, Michal
    Masui, Fumito
    [J]. INFORMATION, 2019, 10 (10)
  • [20] N-gram Language Model for Chinese Function-word-centered Patterns
    Song, Jie
    Liu, Yixiao
    Qu, Yunhua
    [J]. Journal of Computing and Information Technology, 2023, 31 (01) : 39 - 55