Word clustering based on similarity and vari-gram language model

被引:0
|
作者
Yuan, LC [1 ]
Zhong, YX [1 ]
机构
[1] Beijing Univ Posts & Telecommun, Coll Informat Engn, Beijing 100876, Peoples R China
关键词
word clustering; Statistical Language Model; vari-gram;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Class based statistic language model is an important method to solve the problem of sparse,data. But there are two bottlenecks about this model: (1) The problem of word clustering, it is hard to find a suitable clustering method that has good performance and not large amount of computation. (2) Class based method always lose some prediction ability to adapt the text of different domain. The authors try to solve above problems in this paper. This paper presents a novel definition of word similarity. Based on word similarity, this paper gives the definition of word set similarity. Experiments show that word clustering algorithm based on similarity is better than conventional greedy clustering method in speed and performance. At the same time, this paper presents a new method to create the vari-gram model.
引用
收藏
页码:1222 / 1226
页数:5
相关论文
共 50 条
  • [21] A CLUSTERING AND WORD SIMILARITY BASED APPROACH FOR IDENTIFYING PRODUCT FEATURE WORDS
    Suryadi, Dedy
    Kim, Harrison
    DS87-6: PROCEEDINGS OF THE 21ST INTERNATIONAL CONFERENCE ON ENGINEERING DESIGN (ICED 17) VOL 6: DESIGN INFORMATION AND KNOWLEDGE, 2017, : 71 - 80
  • [22] AUDIO WORD SIMILARITY FOR CLUSTERING WITH ZERO RESOURCES BASED ON ITERATIVE HMM CLASSIFICATION
    Royer, Amelie
    Gravier, Guillaume
    Claveau, Vincent
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5340 - 5344
  • [23] A MODEL FOR WORD CLUSTERING
    THOM, JA
    ZOBEL, J
    JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1992, 43 (09): : 616 - 627
  • [24] Language model based arabic word segmentation
    Lee, YS
    Papineni, K
    Roukos, S
    Emam, O
    Hassan, H
    41ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, 2003, : 399 - 406
  • [25] Statistical language models of Lithuanian based on word clustering and morphological decomposition
    Vaiciunas, A
    Kaminskas, V
    Raskinis, G
    INFORMATICA, 2004, 15 (04) : 565 - 580
  • [26] N-gram Language Model for Chinese Function-word-centered Patterns
    Song J.
    Liu Y.
    Qu Y.
    Journal of Computing and Information Technology, 2023, 31 (01) : 39 - 55
  • [27] MiNgMatch-A Fast N-gram Model for Word Segmentation of the Ainu Language
    Nowakowski, Karol
    Ptaszynski, Michal
    Masui, Fumito
    INFORMATION, 2019, 10 (10)
  • [28] AN APPROXIMATION ALGORITHM FOR WORD-REPLACEMENT USING A BI-GRAM LANGUAGE MODEL
    He, Jing
    Liang, Hongyu
    2009 IEEE YOUTH CONFERENCE ON INFORMATION, COMPUTING AND TELECOMMUNICATION, PROCEEDINGS, 2009, : 27 - 30
  • [29] Linguistic Summarization using a Weighted N-gram Language Model based on the Similarity of Time-series Data
    Aoki, Kasumi
    Kobayashi, Ichiro
    2016 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS (FUZZ-IEEE), 2016, : 595 - 601
  • [30] Similarity Word-Sequence Kernels for Sentence Clustering
    Andres-Ferrer, Jesus
    Sanchis-Trilles, German
    Casacuberta, Francisco
    STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION, 2010, 6218 : 610 - 619