Combining text clustering and text retrieval for corpus adaptation

被引:0
|
作者
He F. [1 ]
Ding X. [1 ]
机构
[1] Department of Electronic Engineering, Tsinghua University
来源
关键词
Kullback-Leibler distance; Statistical language model; Text clustering; Text retrieval;
D O I
10.3772/j.issn.1002-0470.2010.12.003
中图分类号
学科分类号
摘要
In order to solve the difficulties brought about in some situations when using the application-relevant text data to do various natural language processings, such as automatic speech recognition and intelligent input due to the hard collection of relevant data and the scarcity of application-relevant training texts, this paper presents a novel method for corpus adaptation by combining the unsupervised text clustering and text retrieval techniques. The method only uses a small set of application specific text to find the relevant text from a large scale of unorganized corpus, thereby, it adapts training corpus towards the application area of interest. The performance of the n-gram statistical language model, which was trained from the text retrieved and tested on the application-specific text, was used to evaluate the relevance of the text acquired. The preliminary experiments on short message texts and unorganized large corpus demonstrated the good performance of the proposed method.
引用
下载
收藏
页码:1224 / 1228
页数:4
相关论文
共 11 条
  • [1] Clarkson P.R., Robinson A.J., Language model adaptation using mixtures and an exponentially decaying cache, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 799-802, (1997)
  • [2] Seymore K., Rosenfeld R., Using story topics for language model adaptation, Proceedings of the European Conference on Speech Communication and Technology, (1997)
  • [3] Souvignier B., Kellner A., Online adaptation for language models in spoken dialogue systems, Proceedings of the International Conference on Spoken Language Processing, (1998)
  • [4] Nanjo H., Kawahara T., Unsupervised language model adaptation for lecture speech recognition, Proceedings of the International Conference on Spoken Language Processing, (2002)
  • [5] Jensson A.T., Iwano K., Furui S., Language model adaptation using machine-translated text for resource-deficient languages, EURASIP Journal on Audio, Speech, and Music Processing, (2008)
  • [6] Manning C., Schutze H., Foundations of Statistical Natural Language Processing, pp. 495-500, (1999)
  • [7] Manning C., Schutze H., Foundations of Statistical Natural Language Processing, pp. 539-544, (1999)
  • [8] Dhillon I.S., Modha D.S., Concept decompositions for large sparse text data using clustering, Machine Learning, 42, 1-2, pp. 143-175, (2001)
  • [9] Chen S.F., Goodman J., An empirical study of smoothing techniques for language modeling, Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, pp. 310-318, (1996)
  • [10] Stolcke A., SRILM: An extensible language modeling toolkit, Proceedings of International Conference on Spoken Language Processing, (2002)