MODALITY ATTENTION FOR END-TO-END AUDIO-VISUAL SPEECH RECOGNITION

被引:0
|
作者
Zhou, Pan [1 ]
Yang, Wenwen [2 ]
Chen, Wei [2 ]
Wang, Yanfeng [2 ]
Jia, Jia [1 ]
机构
[1] Tsinghua Univ, Dept Comp Sci & Technol, Beijing, Peoples R China
[2] Sogou Inc, Voice Interact Technol Ctr, Beijing, Peoples R China
关键词
multimodal attention; audio-visual speech recognition; lipreading; sequence-to-sequence;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Audio-visual speech recognition ( AVSR) system is thought to be one of the most promising solutions for robust speech recognition, especially in noisy environment. In this paper, we propose a novel multimodal attention based method for audio-visual speech recognition which could automatically learn the fused representation from both modalities based on their importance. Our method is realized using state-of-the-art sequence-to-sequence ( Seq2seq) architectures. Experimental results show that relative improvements from 2% up to 36% over the auditory modality alone are obtained depending on the different signal-to-noise-ratio ( SNR). Compared to the traditional feature concatenation methods, our proposed approach can achieve better recognition performance under both clean and noisy conditions. We believe modality attention based end-to-end method can be easily generalized to other multimodal tasks with correlated information.
引用
收藏
页码:6565 / 6569
页数:5
相关论文
共 50 条
  • [1] End-to-end audio-visual speech recognition for overlapping speech
    Rose, Richard
    Siohan, Olivier
    Tripathi, Anshuman
    Braga, Otavio
    [J]. INTERSPEECH 2021, 2021, : 3016 - 3020
  • [2] END-TO-END AUDIO-VISUAL SPEECH RECOGNITION WITH CONFORMERS
    Ma, Pingchuan
    Petridis, Stavros
    Pantic, Maja
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7613 - 7617
  • [3] FUSING INFORMATION STREAMS IN END-TO-END AUDIO-VISUAL SPEECH RECOGNITION
    Yu, Wentao
    Zeiler, Steffen
    Kolossa, Dorothea
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3430 - 3434
  • [4] Investigating the Lombard Effect Influence on End-to-End Audio-Visual Speech Recognition
    Ma, Pingchuan
    Petridis, Stavros
    Pantic, Maja
    [J]. INTERSPEECH 2019, 2019, : 4090 - 4094
  • [5] Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition
    Hong, Joanna
    Kim, Minsu
    Yoo, Daehun
    Ro, Yong Man
    [J]. INTERSPEECH 2022, 2022, : 2838 - 2842
  • [6] Audio-Visual End-to-End Multi-Channel Speech Separation, Dereverberation and Recognition
    Li, Guinan
    Deng, Jiajun
    Geng, Mengzhe
    Jin, Zengrui
    Wang, Tianzi
    Hu, Shujie
    Cui, Mingyu
    Meng, Helen
    Liu, Xunying
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 2707 - 2723
  • [7] End-to-End Bloody Video Recognition by Audio-Visual Feature Fusion
    Hou, Congcong
    Wu, Xiaoyu
    Wang, Ge
    [J]. PATTERN RECOGNITION AND COMPUTER VISION (PRCV 2018), PT I, 2018, 11256 : 501 - 510
  • [8] End-to-End Audio-Visual Neural Speaker Diarization
    He, Mao-kui
    Du, Jun
    Lee, Chin-Hui
    [J]. INTERSPEECH 2022, 2022, : 1461 - 1465
  • [9] END-TO-END MULTI-PERSON AUDIO/VISUAL AUTOMATIC SPEECH RECOGNITION
    Braga, Otavio
    Makino, Takaki
    Siohan, Olivier
    Liao, Hank
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6994 - 6998
  • [10] TRIGGERED ATTENTION FOR END-TO-END SPEECH RECOGNITION
    Moritz, Niko
    Hori, Takaaki
    Le Roux, Jonathan
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5666 - 5670