A multimodal Lombard speech recognition system for the confusable Hindi syllabic units

被引：0

作者：

Maheswari, S. Uma ^{[1
]}

Radha, N. ^{[1
]}

Shahina, A. ^{[1
]}

Prabha, P. ^{[1
]}

Sri, B. T. Preethi ^{[1
]}

Khan, A. Nayeemulla ^{[2
]}

机构：

[1] Sri Sivasubramaniya Nadar Coll Engn, Chennai 603110, India

[2] Vellore Insititute Technol, Chennai, India

来源：

MATERIALS TODAY-PROCEEDINGS | 2022年 / 62卷

关键词：

Lombard speech; Multimodal ASR; Throat Microphone; Visual Speech; HMM; COMPENSATION; NOISE;

D O I：

10.1016/j.matpr.2022.04.996

中图分类号：

T [工业技术];

学科分类号：

08 ;

摘要：

Research work on the design of robust multimodal speech recognition systems making use of acoustic and visual cues, extracted using the relatively noise robust alternate speech sensors is gaining interest in recent times among the speech processing research fraternity. The primary objective of this work is to study the exclusive influence of Lombard effect on Automatic Speech Recognition (ASR) systems towards building robust multimodal ASR systems in adverse environments in the context of Indian languages which are syllabic in nature. The dataset for this work comprises the confusable 145 ConsonantVowel (CV) syllabic units of Hindi language recorded simultaneously using three modalities that capture the acoustic and visual speech cues, namely Normal acoustic Microphone (NM), Throat Microphone (TM) and a camera that captures the associated lip movements. The Lombard effect is induced by feeding crowd noise into the speaker's headphone while recording. HMM models are built to categorize the CV units based on their Place of Articulation (POA), Manner Of Articulation (MOA) and vowels (under clean and Lombard conditions). Unimodal ASR systems built using each speech cue show a recognition loss in all the systems due to Lombard effect. To overcome this loss, the complimentary speech cues taken from normal and throat microphone Lombard speech as well as from visual Lombard speech are used to build three bimodal and one trimodal ASR systems. Among the ASR systems studied, the trimodal system gives the best recognition accuracy of 98%, 95% and 76% for the vowels, MOA and POA, respectively, with an average improvement of 36% over the unimodal ASR systems and 9% improvement over the bimodal ASR systems. Copyright (c) 2022 Elsevier Ltd. All rights reserved. Selection and peer-review under responsibility of the scientific committee of the International Conference on Innovative Technology for Sustainable Development.

引用

页码：5034 / 5041

页数：8

共 50 条

[1] A Study on the Impact of Lombard Effect on Recognition of Hindi Syllabic Units Using CNN Based Multimodal ASR Systems
Uma Maheswari, Sadasivam
Shahina, A.
Rishickesh, Ramesh
Nayeemulla Khan, A.
[J]. ARCHIVES OF ACOUSTICS, 2020, 45 (03) : 419 - 431
[2] An analysis of the effect of combining standard and alternate sensor signals on recognition of syllabic units for multimodal speech recognition
Radha, N.
Shahina, A.
Prabha, P.
Sri, Preethi B. T.
Khan, Nayeemulla A.
[J]. PATTERN RECOGNITION LETTERS, 2018, 115 : 39 - 49
[3] SEGMENTATION OF SPEECH INTO SYLLABIC UNITS
MERMELST.P
KUHN, GM
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1974, 55 : S22 - S22
[4] Continuous speech recognition using automatically segmented data at syllabic units
Prasad, VK
Nagarajan, T
Murthy, HA
[J]. 2002 6TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS, VOLS I AND II, 2002, : 235 - 238
[5] Discriminative Techniques for Hindi Speech Recognition System
Aggarwal, Rajesh Kumar
Dave, Mayank
[J]. INFORMATION SYSTEMS FOR INDIAN LANGUAGES, 2011, 139 : 261 - 266
[6] Detection of confusable words in automatic speech recognition
Anguita, J
Hernando, J
Peillon, S
Bramoullé, A
[J]. IEEE SIGNAL PROCESSING LETTERS, 2005, 12 (08) : 585 - 588
[7] AUTOMATIC SEGMENTATION OF SPEECH INTO SYLLABIC UNITS
MERMELSTEIN, P
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1975, 58 (04): : 880 - 883
[8] Speech Recognition and System Controlling using Hindi Language
Rathor, Sandeep
Jadon, R. S.
[J]. 2019 10TH INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION AND NETWORKING TECHNOLOGIES (ICCCNT), 2019,
[9] Tied Mixture Modeling in Hindi Speech Recognition System
Aggarwal, R. K.
Dave, M.
[J]. INFORMATION AND COMMUNICATION TECHNOLOGIES, 2010, 101 : 514 - 519
[10] Searching for syllabic coding units in speech perception
Dumay, Nicolas
Content, Alain
[J]. JOURNAL OF MEMORY AND LANGUAGE, 2012, 66 (04) : 680 - 694

← 1 2 3 4 5 →