Vari-gram language model based on word clustering

被引:1
|
作者
Yuan Li-chi [1 ,2 ]
机构
[1] Jiangxi Univ Finance & Econ, Sch Informat Technol, Nanchang 330013, Peoples R China
[2] Cent S Univ, Sch Informat Sci & Engn, Changsha 410083, Peoples R China
基金
中国国家自然科学基金;
关键词
word similarity; word clustering; statistical language model; vari-gram language model;
D O I
10.1007/s11771-012-1109-z
中图分类号
TF [冶金工业];
学科分类号
0806 ;
摘要
Category-based statistic language model is an important method to solve the problem of sparse data. But there are two bottlenecks: 1) The problem of word clustering. It is hard to find a suitable clustering method with good performance and less computation. 2) Class-based method always loses the prediction ability to adapt the text in different domains. In order to solve above problems, a definition of word similarity by utilizing mutual information was presented. Based on word similarity, the definition of word set similarity was given. Experiments show that word clustering algorithm based on similarity is better than conventional greedy clustering method in speed and performance, and the perplexity is reduced from 283 to 218. At the same time, an absolute weighted difference method was presented and was used to construct vari-gram language model which has good prediction ability. The perplexity of vari-gram model is reduced from 234.65 to 219.14 on Chinese corpora, and is reduced from 195.56 to 184.25 on English corpora compared with category-based model.
引用
收藏
页码:1057 / 1062
页数:6
相关论文
共 50 条
  • [41] A New Word Clustering Algorithm Based on Word Similarity
    YUAN Lichi
    [J]. Chinese Journal of Electronics, 2017, 26 (06) : 1221 - 1226
  • [42] Word sense disambiguation based on word sense clustering
    Anaya-Sanchez, Henry
    Pons-Porrata, Aurora
    Berlanga-Llavori, Rafael
    [J]. ADVANCES IN ARTIFICIAL INTELLIGENCE - IBERAMIA-SBIA 2006, PROCEEDINGS, 2006, 4140 : 472 - 481
  • [43] A New Word Clustering Algorithm Based on Word Similarity
    Yuan Lichi
    [J]. CHINESE JOURNAL OF ELECTRONICS, 2017, 26 (06) : 1221 - 1226
  • [44] Research on Mixture Language Model-based Document Clustering
    Wen, Jian
    Li, Zhoujun
    [J]. 2008 IEEE INTERNATIONAL CONFERENCE ON GRANULAR COMPUTING, VOLS 1 AND 2, 2008, : 649 - +
  • [45] New Generation Model of Word Vector Representation Based on CBOW or Skip-Gram
    Xiong, Zeyu
    Shen, Qiangqiang
    Xiong, Yueshan
    Wang, Yijie
    Li, Weizi
    [J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2019, 60 (01): : 259 - 273
  • [46] Next word prediction based on the N-gram model for Kurdish Sorani and Kurmanji
    Hozan K. Hamarashid
    Soran A. Saeed
    Tarik A. Rashid
    [J]. Neural Computing and Applications, 2021, 33 : 4547 - 4566
  • [47] Next word prediction based on the N-gram model for Kurdish Sorani and Kurmanji
    Hamarashid, Hozan K.
    Saeed, Soran A.
    Rashid, Tarik A.
    [J]. NEURAL COMPUTING & APPLICATIONS, 2021, 33 (09): : 4547 - 4566
  • [48] Language Modeling by Clustering with Word Embeddings for Text Readability Assessment
    Cha, Miriam
    Gwon, Youngjune
    Kung, H. T.
    [J]. CIKM'17: PROCEEDINGS OF THE 2017 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2017, : 2003 - 2006
  • [49] News Keyword Extraction Algorithm Based on Semantic Clustering and Word Graph Model
    Xiong, Ao
    Liu, Derong
    Tian, Hongkang
    Liu, Zhengyuan
    Yu, Peng
    Kadoch, Michel
    [J]. TSINGHUA SCIENCE AND TECHNOLOGY, 2021, 26 (06) : 886 - 893
  • [50] News Keyword Extraction Algorithm Based on Semantic Clustering and Word Graph Model
    Ao Xiong
    Derong Liu
    Hongkang Tian
    Zhengyuan Liu
    Peng Yu
    Michel Kadoch
    [J]. Tsinghua Science and Technology, 2021, 26 (06) : 886 - 893