IITG-HingCoS corpus: A Hinglish code-switching database for automatic speech recognition

被引:11
|
作者
Ganji, Sreeram [1 ]
Dhawan, Kunal [1 ]
Sinha, Rohit [1 ]
机构
[1] Indian Inst Technol Guwahati, Dept Elect & Elect Engn, Gauhati 781039, India
关键词
Code-switching; Speech and text corpora; Automatic speech recognition; Language modeling;
D O I
10.1016/j.specom.2019.04.007
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Code-switching is a phenomenon in linguistics which refers to the use of two or more languages, especially within the same discourse. This phenomenon has been observed in many multilingual communities across the globe. In the recent past, there have been increasing demand for automatic speech recognition (ASR) systems to deal with code-switching. However, for training such systems, very limited code-switching resources are available as yet. Thus, the development of code-switching resources is highly desirable. In this work, we describe the collection of a Hinglish (Hindi-English) code-switching database at the Indian Institute of Technology Guwahati (IITG) which is referred to as the IITG-HingCoS corpus. This corpus consists of code-switching text data having 25,988 sentences with a total of 0.58 million words. In addition to that, the corpus also contains 25 h of matching speech data corresponding to 9251 code-switching sentences covering a vocabulary of 6542 words. This paper elaborates the sources and the protocol used for collecting the corpus. The baseline experimental results on the collected corpus for language modeling and ASR tasks are also presented.
引用
收藏
页码:76 / 89
页数:14
相关论文
共 50 条
  • [41] ADDRESSING ACCENT MISMATCH IN MANDARIN-ENGLISH CODE-SWITCHING SPEECH RECOGNITION
    Tan, Zhili
    Fan, Xinghua
    Zhu, Hui
    Lin, Ed
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 8259 - 8263
  • [42] TEXTUAL DATA AUGMENTATION FOR ARABIC-ENGLISH CODE-SWITCHING SPEECH RECOGNITION
    Hussein, Amir
    Chowdhury, Shammur Absar
    Abdelali, Ahmed
    Dehak, Najim
    Ali, Ahmed
    Khudanpur, Sanjeev
    [J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 777 - 784
  • [43] Look at the gato! Code-switching in speech to toddlers
    Bail, Amelie
    Morini, Giovanna
    Newman, Rochelle S.
    [J]. JOURNAL OF CHILD LANGUAGE, 2015, 42 (05) : 1073 - 1101
  • [44] Direct Speech in the context of discussion on code-switching
    Barciela, Lois Xacobe Atanes
    [J]. ESTUDOS DE LINGUISTICA GALEGA, 2023, 15
  • [45] Code-Switching in The Malaysian Hansard Corpus: A Corpus-Based Approach
    Izam, Muhammad Zakwan Mohd
    Maros, Marlyna
    Jaludin, Azhar
    Abdullah, Imran Ho
    [J]. GEMA ONLINE JOURNAL OF LANGUAGE STUDIES, 2023, 23 (02): : 220 - 240
  • [46] SEAME: a Mandarin-English Code-switching Speech Corpus in South-East Asia
    Lyu, Dau-Cheng
    Tan, Tien-Ping
    Chng, Eng-Siong
    Li, Haizhou
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 1986 - +
  • [47] Mandarin-English code-switching speech corpus in South-East Asia: SEAME
    Lyu, Dau-Cheng
    Tan, Tien-Ping
    Chng, Eng-Siong
    Li, Haizhou
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2015, 49 (03) : 581 - 600
  • [48] Corpus for automatic speech recognition
    Adda-Decker, Martine
    [J]. REVUE FRANCAISE DE LINGUISTIQUE APPLIQUEE, 2007, 12 (01): : 71 - 84
  • [49] ALGERIAN ARABIC SPEECH DATABASE (ALGASD): CORPUS DESIGN AND AUTOMATIC SPEECH RECOGNITION APPLICATION
    Droua-Hamdani, Ghania
    Selouani, Sid Ahmed
    Boudraa, Malika
    [J]. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2010, 35 (2C): : 157 - 166
  • [50] On the End-to-End Solution to Mandarin-English Code-switching Speech Recognition
    Zeng, Zhiping
    Khassanov, Yerbolat
    Van Tung Pham
    Xu, Haihua
    Chng, Eng Siong
    Li, Haizhou
    [J]. INTERSPEECH 2019, 2019, : 2165 - 2169