Comparative study on corpora for speech translation

被引:26
|
作者
Kikui, Genichiro [1 ]
Yamamoto, Seiichi
Takezawa, Toshiyuki
Sumita, Eiichiro
机构
[1] NTT Corp, Cyberspace Labs, Kanagawa 2390847, Japan
[2] ATR, Spoken Language Commun Res Labs, Kyoto 6190288, Japan
关键词
corpus; machine translation; speech translation; spoken dialog;
D O I
10.1109/TASL.2006.878262
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper investigates issues in preparing corpora for developing speech-to-speech translation (S2ST). It is impractical to create a broad-coverage parallel corpus only from dialog speech. An alternative approach is to have bilingual experts write conversational-style texts in the target domain, with translations. There is, however, a risk of losing fidelity to the actual utterances. This paper focuses on balancing a tradeoff between these two kinds of corpora through the analysis of two newly developed corpora in the travel domain: a bilingual parallel corpus with 420 K utterances and a collection of in-domain dialogs using actual S2ST systems. We found that the first corpus is effective for covering utterances in the second corpus if complimented with a small number of utterances taken from monolingual dialogs. We also found that characteristics of in-domain utterances become closer to those of the first corpus when more restrictive conditions and instructions to speakers are given. These results suggest the possibility of a bootstrap-style of development of corpora and S2ST systems, where an initial S2ST system is developed with parallel texts, and is then gradually improved with in-domain utterances collected by the system as restrictions are relaxed.
引用
收藏
页码:1674 / 1682
页数:9
相关论文
共 50 条
  • [1] Automatic Phonetic Transcription of Large Speech Corpora: A Comparative Study
    Van Bael, Christophe
    Boves, Lou
    van den Heuvel, Henk
    Strik, Helmer
    [J]. INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1085 - 1088
  • [2] Construction of Chinese Conversational Corpora for Spontaneous Speech Recognition and Comparative Study on the Trilingual Parallel Corpora
    Hu, Xinhui
    Isotani, Ryosuke
    Nakamura, Satoshi
    [J]. ORIENTAL COCOSDA 2009 - INTERNATIONAL CONFERENCE ON SPEECH DATABASE AND ASSESSMENTS, 2009, : 56 - 59
  • [3] Translation and corpora, corpora and translation
    Williams, Geoffrey
    [J]. RECHERCHE ET PRATIQUES PEDAGOGIQUES EN LANGUES DE SPECIALITE-CAHIERS DE L APLIUT, 2008, 27 (01): : 69 - 79
  • [4] A COMPARATIVE STUDY ON END-TO-END SPEECH TO TEXT TRANSLATION
    Bahar, Parnia
    Bieschke, Tobias
    Ney, Hermann
    [J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 792 - 799
  • [5] CLASSIFICATION OF CLEAN AND NOISY BILINGUAL MOVIE AUDIO FOR SPEECH-TO-SPEECH TRANSLATION CORPORA DESIGN
    Tsiartas, Andreas
    Ghosh, Prasanta Kumar
    Georgiou, Panayiotis
    Narayanan, Shrikanth
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [6] Corpora in Translation
    Oliveira Carneiro, Raphael Marco
    Novodvorski, Ariel
    [J]. CADERNOS DE TRADUCAO, 2015, 35 (02): : 430 - 440
  • [7] Degrees of Orality in Speech-like Corpora: Comparative Annotation of Chat and E-mail Corpora
    Bick, Eckhard
    [J]. PROCEEDINGS OF THE 24TH PACIFIC ASIA CONFERENCE ON LANGUAGE, INFORMATION AND COMPUTATION, 2010, : 721 - 729
  • [8] Translation studies and representative corpora: Establishing links between translation corpora, theoretical/descriptive categories and a conception of the object of study
    Halverson, S
    [J]. META, 1998, 43 (04) : 494 - 514
  • [9] Corpora and LSP translation
    Kubler, Natalie
    [J]. CORPORA IN TRANSLATOR EDUCATION, 2003, : 25 - 42
  • [10] A Study on Translation Teaching Model Based on Internet Corpora
    Li, Zhiyuan
    Sun, Lihua
    [J]. 2017 7TH INTERNATIONAL CONFERENCE ON EDUCATION AND SPORTS EDUCATION (ESE 2017), VOL 1, 2017, 79 : 106 - 109