Vari-gram language model based on word clustering

被引:0
|
作者
袁里驰
机构
[1] School of Information Science and Engineering,Central South University
[2] School of Information Technology,Jiangxi University of Finance and Economics
基金
中国国家自然科学基金;
关键词
word similarity; word clustering; statistical language model; vari-gram language model;
D O I
暂无
中图分类号
TP311.13 [];
学科分类号
1201 ;
摘要
Category-based statistic language model is an important method to solve the problem of sparse data.But there are two bottlenecks:1) The problem of word clustering.It is hard to find a suitable clustering method with good performance and less computation.2) Class-based method always loses the prediction ability to adapt the text in different domains.In order to solve above problems,a definition of word similarity by utilizing mutual information was presented.Based on word similarity,the definition of word set similarity was given.Experiments show that word clustering algorithm based on similarity is better than conventional greedy clustering method in speed and performance,and the perplexity is reduced from 283 to 218.At the same time,an absolute weighted difference method was presented and was used to construct vari-gram language model which has good prediction ability.The perplexity of vari-gram model is reduced from 234.65 to 219.14 on Chinese corpora,and is reduced from 195.56 to 184.25 on English corpora compared with category-based model.
引用
收藏
页码:1057 / 1062
页数:6
相关论文
共 50 条
  • [21] A clustering-based topic model using word networks and word embeddings
    Wenchuan Mu
    Kwan Hui Lim
    Junhua Liu
    Shanika Karunasekera
    Lucia Falzon
    Aaron Harwood
    [J]. Journal of Big Data, 9
  • [22] Short Text Clustering based on Word Semantic Graph with Word Embedding Model
    Jinarat, Supakpong
    Manaskasemsak, Bundit
    Rungsawang, Arnon
    [J]. 2018 JOINT 10TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING AND INTELLIGENT SYSTEMS (SCIS) AND 19TH INTERNATIONAL SYMPOSIUM ON ADVANCED INTELLIGENT SYSTEMS (ISIS), 2018, : 1427 - 1432
  • [23] New word clustering method for building n-gram language models in continuous speech recognition systems
    Bahrani, Mohammad
    Sameti, Hossein
    Hafezi, Nazila
    Momtazi, Saeedeh
    [J]. NEW FRONTIERS IN APPLIED ARTIFICIAL INTELLIGENCE, 2008, 5027 : 286 - 293
  • [24] A clustering-based topic model using word networks and word embeddings
    Mu, Wenchuan
    Lim, Kwan Hui
    Liu, Junhua
    Karunasekera, Shanika
    Falzon, Lucia
    Harwood, Aaron
    [J]. JOURNAL OF BIG DATA, 2022, 9 (01)
  • [25] Comparing neural- and N-gram-based language models for word segmentation
    Doval, Yerai
    Gomez-Rodriguez, Carlos
    [J]. JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2019, 70 (02) : 187 - 197
  • [26] Word clustering with parallel spoken language corpora
    Wang, YY
    Lafferty, J
    Waibel, A
    [J]. ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 2364 - 2367
  • [27] Language clustering with word co-occurrence networks based on parallel texts
    LIU HaiTao
    CONG Jin
    [J]. Science Bulletin, 2013, 58 (10) : 1139 - 1144
  • [28] A Bit Progress on Word-Based Language Model
    陈勇
    陈国评
    [J]. Advances in Manufacturing, 2003, (02) : 148 - 155
  • [29] Language clustering with word co-occurrence networks based on parallel texts
    Liu HaiTao
    Cong Jin
    [J]. CHINESE SCIENCE BULLETIN, 2013, 58 (10): : 1139 - 1144
  • [30] Word Clustering Algorithms Based on Word Similarity
    Yuan, Lichi
    [J]. 2015 7TH INTERNATIONAL CONFERENCE ON INTELLIGENT HUMAN-MACHINE SYSTEMS AND CYBERNETICS IHMSC 2015, VOL I, 2015, : 21 - 24