Training a language model using webdata for large vocabulary Japanese spontaneous speech recognition

被引：0

作者：

Masumura, Ryo ^{[1
]}

Hahm, Seongjun ^{[1
]}

Ito, Akinori ^{[1
]}

机构：

[1] Tohoku Univ, Grad Sch Engn, Sendai, Miyagi 980, Japan

来源：

12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5 | 2011年

关键词：

Spontaneous speech recognition; language model; World Wide Web; large vocabulary continuous speech recognition; Corpus of Spontaneous Japanese;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper describes a language modeling method using large-scale spoken language data retrieved from the Web for spontaneous speech recognition. We downloaded 15 million Web pages on a comprehensive range topics. Next, spoken language-like texts were selected from the downloaded Web data using the naive Bayes classifier, and typical linguistic phenomena such as fillers and pauses were added using simulation models. A language model trained by the generated data gave as high performance as the large-scale spontaneous speech corpus (Corpus of Spontaneous Japanese, CSJ). By combining the generated data and CSJ, we improved word accuracy.

引用

页码：1476 / 1479

页数：4

共 50 条

[1] Training a language model using webdata for large vocabulary Japanese spontaneous speech recognition
Masumura, Ryo
Hahm, Seongjun
Ito, Akinori
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2011, : 1465 - 1468
[2] Spoken language identification using large vocabulary speech recognition
Bell Lab, Murray Hill, United States
Int Conf Spoken Lang Process ICSLP Proc, 1600, (1780-1783):
[3] Large-vocabulary spontaneous speech recognition using a corpus of lectures
Nishimura, M
Itoh, N
ELECTRONICS AND COMMUNICATIONS IN JAPAN PART III-FUNDAMENTAL ELECTRONIC SCIENCE, 2003, 86 (08): : 52 - 60
[4] A unified language model for large vocabulary continuous speech recognition of Turkish
Arisoy, Ebru
Dutagaci, Helin
Arslan, Levent M.
SIGNAL PROCESSING, 2006, 86 (10) : 2844 - 2862
[5] Spoken language identification using large vocabulary speech recognition.
Hieronymus, JL
Kadambe, S
ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 1780 - 1783
[6] Automatic language identification using large vocabulary continuous speech recognition
Mendoza, S
Gillick, L
Ito, Y
Lowe, S
Newmann, M
1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 785 - 788
[7] Large vocabulary speech recognition of Slovenian language using morphological models
Maucec, M
Rotovnik, T
Kacic, Z
Horvat, B
IEEE REGION 8 EUROCON 2003, VOL B, PROCEEDINGS: COMPUTER AS A TOOL, 2003, : 158 - 161
[8] Syllable based language model for large vocabulary continuous speech recognition of Uyghur
Silamu, W. (wushour@xju.edu.cn), 1600, Tsinghua University (53):
[9] Language-model look-ahead for large vocabulary speech recognition
Ortmanns, S
Ney, H
Eiden, A
ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 2095 - 2098
[10] Syllable Based Language Model for Large Vocabulary Continuous Speech Recognition of Polish
Majewski, Piotr
TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2008, 5246 : 397 - 401

← 1 2 3 4 5 →