Word clustering based on similarity and vari-gram language model

被引:0
|
作者
Yuan, LC [1 ]
Zhong, YX [1 ]
机构
[1] Beijing Univ Posts & Telecommun, Coll Informat Engn, Beijing 100876, Peoples R China
关键词
word clustering; Statistical Language Model; vari-gram;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Class based statistic language model is an important method to solve the problem of sparse,data. But there are two bottlenecks about this model: (1) The problem of word clustering, it is hard to find a suitable clustering method that has good performance and not large amount of computation. (2) Class based method always lose some prediction ability to adapt the text of different domain. The authors try to solve above problems in this paper. This paper presents a novel definition of word similarity. Based on word similarity, this paper gives the definition of word set similarity. Experiments show that word clustering algorithm based on similarity is better than conventional greedy clustering method in speed and performance. At the same time, this paper presents a new method to create the vari-gram model.
引用
收藏
页码:1222 / 1226
页数:5
相关论文
共 50 条
  • [21] Language model based arabic word segmentation
    Lee, YS
    Papineni, K
    Roukos, S
    Emam, O
    Hassan, H
    [J]. 41ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, 2003, : 399 - 406
  • [22] Statistical language models of Lithuanian based on word clustering and morphological decomposition
    Vaiciunas, A
    Kaminskas, V
    Raskinis, G
    [J]. INFORMATICA, 2004, 15 (04) : 565 - 580
  • [23] AN APPROXIMATION ALGORITHM FOR WORD-REPLACEMENT USING A BI-GRAM LANGUAGE MODEL
    He, Jing
    Liang, Hongyu
    [J]. 2009 IEEE YOUTH CONFERENCE ON INFORMATION, COMPUTING AND TELECOMMUNICATION, PROCEEDINGS, 2009, : 27 - 30
  • [24] N-gram Language Model for Chinese Function-word-centered Patterns
    Song, Jie
    Liu, Yixiao
    Qu, Yunhua
    [J]. Journal of Computing and Information Technology, 2023, 31 (01) : 39 - 55
  • [25] MiNgMatch-A Fast N-gram Model for Word Segmentation of the Ainu Language
    Nowakowski, Karol
    Ptaszynski, Michal
    Masui, Fumito
    [J]. INFORMATION, 2019, 10 (10)
  • [26] Linguistic Summarization using a Weighted N-gram Language Model based on the Similarity of Time-series Data
    Aoki, Kasumi
    Kobayashi, Ichiro
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS (FUZZ-IEEE), 2016, : 595 - 601
  • [27] Similarity Word-Sequence Kernels for Sentence Clustering
    Andres-Ferrer, Jesus
    Sanchis-Trilles, German
    Casacuberta, Francisco
    [J]. STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION, 2010, 6218 : 610 - 619
  • [28] A clustering-based topic model using word networks and word embeddings
    Mu, Wenchuan
    Lim, Kwan Hui
    Liu, Junhua
    Karunasekera, Shanika
    Falzon, Lucia
    Harwood, Aaron
    [J]. JOURNAL OF BIG DATA, 2022, 9 (01)
  • [29] Short Text Clustering based on Word Semantic Graph with Word Embedding Model
    Jinarat, Supakpong
    Manaskasemsak, Bundit
    Rungsawang, Arnon
    [J]. 2018 JOINT 10TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING AND INTELLIGENT SYSTEMS (SCIS) AND 19TH INTERNATIONAL SYMPOSIUM ON ADVANCED INTELLIGENT SYSTEMS (ISIS), 2018, : 1427 - 1432
  • [30] A clustering-based topic model using word networks and word embeddings
    Wenchuan Mu
    Kwan Hui Lim
    Junhua Liu
    Shanika Karunasekera
    Lucia Falzon
    Aaron Harwood
    [J]. Journal of Big Data, 9