Detecting Mismatch Between Speech and Transcription Using Cross-Modal Attention

被引：0

作者：

Huang, Qiang ^{[1
]}

Hain, Thomas ^{[1
]}

机构：

[1] Univ Sheffield, Dept Comp Sci, Sheffield, S Yorkshire, England

来源：

INTERSPEECH 2019 | 2019年

关键词：

mismatch detection; deep learning; attention;

D O I：

10.21437/Interspeech.2019-2125

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

In this paper, we propose to detect mismatches between speech and transcriptions using deep neural networks. Although it is generally assumed there are no mismatches in some speech related applications, it is hard to avoid the errors due to one reason or another. Moreover, the use of mismatched data probably leads to performance reduction when training a model. In our work, instead of detecting the errors by computing the distance between manual transcriptions and text strings obtained using a speech recogniser, we view mismatch detection as a classification task and merge speech and transcription features using deep neural networks. To enhance detection ability, we use cross-modal attention mechanism in our approach by learning the relevance between the features obtained from the two modalities. To evaluate the effectiveness of our approach, we test it on Factored WSJCAM0 by randomly setting three kinds of mismatch, word deletion, insertion or substitution. To test its robustness, we train our models using a small number of samples and detect mismatch with different number of words being removed, inserted, and substituted. In our experiments, the results show the use of our approach for mismatch detection is close to 80% on insertion and deletion and outperforms the baseline.

引用

页码：584 / 588

页数：5

共 50 条

[21] Cross-modal attention for multi-modal image registration
Song, Xinrui
Chao, Hanqing
Xu, Xuanang
Guo, Hengtao
Xu, Sheng
Turkbey, Baris
Wood, Bradford J.
Sanford, Thomas
Wang, Ge
Yan, Pingkun
MEDICAL IMAGE ANALYSIS, 2022, 82
[22] Cross-modal signatures in maternal speech and singing
Trehub, Sandra E.
Plantinga, Judy
Brcic, Jelena
Nowicki, Magda
FRONTIERS IN PSYCHOLOGY, 2013, 4
[23] Cross-modal retrieval of scripted speech audio
Owen, CB
Makedon, F
MULTIMEDIA COMPUTING AND NETWORKING 1998, 1997, 3310 : 226 - 235
[24] On asymmetries in cross-modal spatial attention orienting
Lawrence M. Ward
John J. Mcdonald
Daniel Lin
Perception & Psychophysics, 2000, 62 : 1258 - 1264
[25] COVERT CROSS-MODAL ORIENTING OF ATTENTION IN SPACE
KLEIN, R
BRENNAN, M
GILANI, A
BULLETIN OF THE PSYCHONOMIC SOCIETY, 1987, 25 (05) : 332 - 332
[26] When Cross-Modal Spatial Attention Fails
Prime, David J.
McDonald, John J.
Green, Jessica
Ward, Lawrence M.
CANADIAN JOURNAL OF EXPERIMENTAL PSYCHOLOGY-REVUE CANADIENNE DE PSYCHOLOGIE EXPERIMENTALE, 2008, 62 (03): : 192 - 197
[27] Cross-modal cueing in audiovisual spatial attention
Steven P. Blurton
Mark W. Greenlee
Matthias Gondan
Attention, Perception, & Psychophysics, 2015, 77 : 2356 - 2376
[28] On asymmetries in cross-modal spatial attention orienting
Ward, LM
McDonald, JJ
Lin, D
PERCEPTION & PSYCHOPHYSICS, 2000, 62 (06): : 1258 - 1264
[29] Cross-modal Contrastive Learning for Speech Translation
Ye, Rong
Wang, Mingxuan
Li, Lei
NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 5099 - 5113
[30] Deep medical cross-modal attention hashing
Zhang, Yong
Ou, Weihua
Shi, Yufeng
Deng, Jiaxin
You, Xinge
Wang, Anzhi
WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2022, 25 (04): : 1519 - 1536

← 1 2 3 4 5 →