Mutual Alignment between Audiovisual Features for End-to-End Audiovisual Speech Recognition

被引:5
|
作者
Liu, Hong [1 ]
Wang, Yawei [1 ]
Yang, Bing [1 ]
机构
[1] Peking Univ, Key Lab Machine Percept, Shenzhen Grad Sch, Shenzhen, Peoples R China
基金
中国国家自然科学基金;
关键词
multimodal alignment; audio visual speech recognition; mutual iterative attention;
D O I
10.1109/ICPR48806.2021.9412349
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Asynchronization issue caused by different types of modalities is one of the major problems in audio visual speech recognition (AVSR) research. However, most AVSR systems merely rely on up sampling of video or down sampling of audio to align audio and visual features, assuming that the feature sequences are aligned frame-by-frame. These pre-processing steps oversimplify the asynchrony relation between acoustic signal and lip motion, lacking flexibility and impairing the performance of the system. Although there are systems modeling the asynchrony between the modalities, sometimes they fail to align speech and video precisely over some even all noisy conditions. In this paper, we propose a mutual feature alignment method for AVSR which can make full use of cross modility information to address the asynchronization issue by introducing Mutual Iterative Attention (MIA) mechanism. Our method can automatically learn an alignment in a mutual way by performing mutual attention iteratively between the audio and visual features, relying on the modified encoder structure of Transformer. Experimental results show that our proposed method obtains absolute improvements up to 20.42% over the audio modality alone depending upon the signal-to-noise-ratio (SNR) level. Better recognition performance can also be achieved comparing with the traditional feature concatenation method under both clean and noisy conditions. It is expectable that our proposed mutual feature alignment method can be easily generalized to other multimodal tasks with semantically correlated information.
引用
收藏
页码:5348 / 5353
页数:6
相关论文
共 50 条
  • [21] END-TO-END TRAINING OF A LARGE VOCABULARY END-TO-END SPEECH RECOGNITION SYSTEM
    Kim, Chanwoo
    Kim, Sungsoo
    Kim, Kwangyoun
    Kumar, Mehul
    Kim, Jiyeon
    Lee, Kyungmin
    Han, Changwoo
    Garg, Abhinav
    Kim, Eunhyang
    Shin, Minkyoo
    Singh, Shatrughan
    Heck, Larry
    Gowda, Dhananjaya
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 562 - 569
  • [22] End-to-End Noisy Speech Recognition Using Fourier and Hilbert Spectrum Features
    Vazhenina, Daria
    Markov, Konstantin
    ELECTRONICS, 2020, 9 (07) : 1 - 18
  • [23] SYNCHRONOUS TRANSFORMERS FOR END-TO-END SPEECH RECOGNITION
    Tian, Zhengkun
    Yi, Jiangyan
    Bai, Ye
    Tao, Jianhua
    Zhang, Shuai
    Wen, Zhengqi
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7884 - 7888
  • [24] End-to-End Speech Recognition For Arabic Dialects
    Seham Nasr
    Rehab Duwairi
    Muhannad Quwaider
    Arabian Journal for Science and Engineering, 2023, 48 : 10617 - 10633
  • [25] End-to-End Speech Recognition of Tamil Language
    Changrampadi, Mohamed Hashim
    Shahina, A.
    Narayanan, M. Badri
    Khan, A. Nayeemulla
    INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2022, 32 (02): : 1309 - 1323
  • [26] PARAMETER UNCERTAINTY FOR END-TO-END SPEECH RECOGNITION
    Braun, Stefan
    Liu, Shih-Chii
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5636 - 5640
  • [27] END-TO-END VISUAL SPEECH RECOGNITION WITH LSTMS
    Petridis, Stavros
    Li, Zuwei
    Pantic, Maja
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 2592 - 2596
  • [28] An End-to-End model for Vietnamese speech recognition
    Van Huy Nguyen
    2019 IEEE - RIVF INTERNATIONAL CONFERENCE ON COMPUTING AND COMMUNICATION TECHNOLOGIES (RIVF), 2019, : 307 - 312
  • [29] Review of End-to-End Streaming Speech Recognition
    Wang, Aohui
    Zhang, Long
    Song, Wenyu
    Meng, Jie
    Computer Engineering and Applications, 2024, 59 (02) : 22 - 33
  • [30] End-to-End Speech Recognition For Arabic Dialects
    Nasr, Seham
    Duwairi, Rehab
    Quwaider, Muhannad
    ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2023, 48 (08) : 10617 - 10633