Detecting Mismatch Between Speech and Transcription Using Cross-Modal Attention

被引:0
|
作者
Huang, Qiang [1 ]
Hain, Thomas [1 ]
机构
[1] Univ Sheffield, Dept Comp Sci, Sheffield, S Yorkshire, England
来源
关键词
mismatch detection; deep learning; attention;
D O I
10.21437/Interspeech.2019-2125
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
In this paper, we propose to detect mismatches between speech and transcriptions using deep neural networks. Although it is generally assumed there are no mismatches in some speech related applications, it is hard to avoid the errors due to one reason or another. Moreover, the use of mismatched data probably leads to performance reduction when training a model. In our work, instead of detecting the errors by computing the distance between manual transcriptions and text strings obtained using a speech recogniser, we view mismatch detection as a classification task and merge speech and transcription features using deep neural networks. To enhance detection ability, we use cross-modal attention mechanism in our approach by learning the relevance between the features obtained from the two modalities. To evaluate the effectiveness of our approach, we test it on Factored WSJCAM0 by randomly setting three kinds of mismatch, word deletion, insertion or substitution. To test its robustness, we train our models using a small number of samples and detect mismatch with different number of words being removed, inserted, and substituted. In our experiments, the results show the use of our approach for mismatch detection is close to 80% on insertion and deletion and outperforms the baseline.
引用
收藏
页码:584 / 588
页数:5
相关论文
共 50 条
  • [31] Cross-modal exogenous visual selective attention
    Zhao, C
    Yang, H
    Zhang, K
    INTERNATIONAL JOURNAL OF PSYCHOLOGY, 2000, 35 (3-4) : 100 - 100
  • [32] Cross-modal cueing in audiovisual spatial attention
    Blurton, Steven P.
    Greenlee, Mark W.
    Gondan, Matthias
    ATTENTION PERCEPTION & PSYCHOPHYSICS, 2015, 77 (07) : 2356 - 2376
  • [33] Cross-modal attention enhances perceived contrast
    Carrasco, Marisa
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2009, 106 (52) : 22039 - 22040
  • [34] Deep medical cross-modal attention hashing
    Yong Zhang
    Weihua Ou
    Yufeng Shi
    Jiaxin Deng
    Xinge You
    Anzhi Wang
    World Wide Web, 2022, 25 : 1519 - 1536
  • [35] DISENTANGLED SPEECH EMBEDDINGS USING CROSS-MODAL SELF-SUPERVISION
    Nagrani, Arsha
    Chung, Joon Son
    Albanie, Samuel
    Zisserman, Andrew
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6829 - 6833
  • [36] The modulation of selective attention and divided attention on cross-modal congruence
    Xu, Honghui
    Yang, Guochun
    Goeschl, Florian
    Ren, Qiaoyue
    Yu, Mei
    Li, Qi
    Liu, Xun
    NEUROIMAGE, 2025, 309
  • [37] The Role of Selective Attention in Cross-modal Interactions between Auditory and Visual Features
    Evans, Karla K.
    COGNITION, 2020, 196
  • [38] From Speech to Facial Activity: Towards Cross-modal Sequence-to-Sequence Attention Networks
    Stappen, Lukas
    Karas, Vincent
    Cummins, Nicholas
    Ringeval, Fabien
    Scherer, Klaus
    Schuller, Bjorn
    2019 IEEE 21ST INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP 2019), 2019,
  • [39] The effect of gender stereotypes on cross-modal spatial attention
    Gu, Jiyou
    Dong, Huiqin
    SOCIAL BEHAVIOR AND PERSONALITY, 2021, 49 (09):
  • [40] Cross-modal recipe retrieval with stacked attention model
    Jing-Jing Chen
    Lei Pang
    Chong-Wah Ngo
    Multimedia Tools and Applications, 2018, 77 : 29457 - 29473