Training a language model using webdata for large vocabulary Japanese spontaneous speech recognition

被引:0
|
作者
Masumura, Ryo [1 ]
Hahm, Seongjun [1 ]
Ito, Akinori [1 ]
机构
[1] Tohoku Univ, Grad Sch Engn, Sendai, Miyagi 980, Japan
关键词
Spontaneous speech recognition; language model; World Wide Web; large vocabulary continuous speech recognition; Corpus of Spontaneous Japanese;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper describes a language modeling method using large-scale spoken language data retrieved from the Web for spontaneous speech recognition. We downloaded 15 million Web pages on a comprehensive range topics. Next, spoken language-like texts were selected from the downloaded Web data using the naive Bayes classifier, and typical linguistic phenomena such as fillers and pauses were added using simulation models. A language model trained by the generated data gave as high performance as the large-scale spontaneous speech corpus (Corpus of Spontaneous Japanese, CSJ). By combining the generated data and CSJ, we improved word accuracy.
引用
收藏
页码:1476 / 1479
页数:4
相关论文
共 50 条
  • [1] Training a language model using webdata for large vocabulary Japanese spontaneous speech recognition
    Masumura, Ryo
    Hahm, Seongjun
    Ito, Akinori
    [J]. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2011, : 1465 - 1468
  • [2] Spoken language identification using large vocabulary speech recognition
    Bell Lab, Murray Hill, United States
    [J]. Int Conf Spoken Lang Process ICSLP Proc, 1600, (1780-1783):
  • [3] Large-vocabulary spontaneous speech recognition using a corpus of lectures
    Nishimura, M
    Itoh, N
    [J]. ELECTRONICS AND COMMUNICATIONS IN JAPAN PART III-FUNDAMENTAL ELECTRONIC SCIENCE, 2003, 86 (08): : 52 - 60
  • [4] A unified language model for large vocabulary continuous speech recognition of Turkish
    Arisoy, Ebru
    Dutagaci, Helin
    Arslan, Levent M.
    [J]. SIGNAL PROCESSING, 2006, 86 (10) : 2844 - 2862
  • [5] Automatic language identification using large vocabulary continuous speech recognition
    Mendoza, S
    Gillick, L
    Ito, Y
    Lowe, S
    Newmann, M
    [J]. 1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 785 - 788
  • [6] Spoken language identification using large vocabulary speech recognition.
    Hieronymus, JL
    Kadambe, S
    [J]. ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 1780 - 1783
  • [7] Large vocabulary speech recognition of Slovenian language using morphological models
    Maucec, M
    Rotovnik, T
    Kacic, Z
    Horvat, B
    [J]. IEEE REGION 8 EUROCON 2003, VOL B, PROCEEDINGS: COMPUTER AS A TOOL, 2003, : 158 - 161
  • [8] Syllable based language model for large vocabulary continuous speech recognition of Uyghur
    [J]. Silamu, W. (wushour@xju.edu.cn), 1600, Tsinghua University (53):
  • [9] Language-model look-ahead for large vocabulary speech recognition
    Ortmanns, S
    Ney, H
    Eiden, A
    [J]. ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 2095 - 2098
  • [10] Syllable Based Language Model for Large Vocabulary Continuous Speech Recognition of Polish
    Majewski, Piotr
    [J]. TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2008, 5246 : 397 - 401