Detecting Mismatch Between Speech and Transcription Using Cross-Modal Attention

被引:0
|
作者
Huang, Qiang [1 ]
Hain, Thomas [1 ]
机构
[1] Univ Sheffield, Dept Comp Sci, Sheffield, S Yorkshire, England
来源
关键词
mismatch detection; deep learning; attention;
D O I
10.21437/Interspeech.2019-2125
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
In this paper, we propose to detect mismatches between speech and transcriptions using deep neural networks. Although it is generally assumed there are no mismatches in some speech related applications, it is hard to avoid the errors due to one reason or another. Moreover, the use of mismatched data probably leads to performance reduction when training a model. In our work, instead of detecting the errors by computing the distance between manual transcriptions and text strings obtained using a speech recogniser, we view mismatch detection as a classification task and merge speech and transcription features using deep neural networks. To enhance detection ability, we use cross-modal attention mechanism in our approach by learning the relevance between the features obtained from the two modalities. To evaluate the effectiveness of our approach, we test it on Factored WSJCAM0 by randomly setting three kinds of mismatch, word deletion, insertion or substitution. To test its robustness, we train our models using a small number of samples and detect mismatch with different number of words being removed, inserted, and substituted. In our experiments, the results show the use of our approach for mismatch detection is close to 80% on insertion and deletion and outperforms the baseline.
引用
收藏
页码:584 / 588
页数:5
相关论文
共 50 条
  • [1] Mismatch negativity of ERP in cross-modal attention
    Luo, YJ
    Wei, JH
    SCIENCE IN CHINA SERIES C-LIFE SCIENCES, 1997, 40 (06): : 604 - 612
  • [2] Mismatch negativity of ERP in cross-modal attention
    Yuejia Luo
    Jinghan Wei
    Science in China Series C: Life Sciences, 1997, 40 : 604 - 612
  • [3] Mismatch negativity of ERP in cross-modal attention
    罗跃嘉
    魏景汉
    Science in China(Series C:Life Sciences), 1997, (06) : 604 - 612
  • [4] Lip and speech synchronization using supervised contrastive learning and cross-modal attention
    Varshney, Munender
    Mukherji, Mayurakshi
    Senthil, Raja G.
    Ganesh, Ananth
    Banerjee, Kingshuk
    2024 IEEE 18TH INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION, FG 2024, 2024,
  • [5] Cross-Modal Interactions Between Auditory Attention and Oculomotor Control
    Zhao, Sijia
    Contadini-Wright, Claudia
    Chait, Maria
    JOURNAL OF NEUROSCIENCE, 2024, 44 (11):
  • [6] Cross-Modal Attention Network for Detecting Multimodal Misinformation From Multiple Platforms
    Guo, Zhiwei
    Li, Yang
    Yang, Zhenguo
    Li, Xiaoping
    Lee, Lap-Kei
    Li, Qing
    Liu, Wenyin
    IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2024, 11 (04) : 4920 - 4933
  • [7] Cross-modal decoupling in temporal attention between audition and touch
    Stefanie Mühlberg
    Salvador Soto-Faraco
    Psychological Research, 2019, 83 : 1626 - 1639
  • [8] Cross-modal decoupling in temporal attention between audition and touch
    Muhlberg, Stefanie
    Soto-Faraco, Salvador
    PSYCHOLOGICAL RESEARCH-PSYCHOLOGISCHE FORSCHUNG, 2019, 83 (08): : 1626 - 1639
  • [9] Cross-Modal Prediction in Speech Perception
    Sanchez-Garcia, Carolina
    Alsius, Agnes
    Enns, James T.
    Soto-Faraco, Salvador
    PLOS ONE, 2011, 6 (10):
  • [10] Cross-Modal Effects in Speech Perception
    Keough, Megan
    Derrick, Donald
    Gick, Bryan
    ANNUAL REVIEW OF LINGUISTICS, VOL 5, 2019, 5 : 49 - 66