Mutual Alignment between Audiovisual Features for End-to-End Audiovisual Speech Recognition

被引:5
|
作者
Liu, Hong [1 ]
Wang, Yawei [1 ]
Yang, Bing [1 ]
机构
[1] Peking Univ, Key Lab Machine Percept, Shenzhen Grad Sch, Shenzhen, Peoples R China
基金
中国国家自然科学基金;
关键词
multimodal alignment; audio visual speech recognition; mutual iterative attention;
D O I
10.1109/ICPR48806.2021.9412349
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Asynchronization issue caused by different types of modalities is one of the major problems in audio visual speech recognition (AVSR) research. However, most AVSR systems merely rely on up sampling of video or down sampling of audio to align audio and visual features, assuming that the feature sequences are aligned frame-by-frame. These pre-processing steps oversimplify the asynchrony relation between acoustic signal and lip motion, lacking flexibility and impairing the performance of the system. Although there are systems modeling the asynchrony between the modalities, sometimes they fail to align speech and video precisely over some even all noisy conditions. In this paper, we propose a mutual feature alignment method for AVSR which can make full use of cross modility information to address the asynchronization issue by introducing Mutual Iterative Attention (MIA) mechanism. Our method can automatically learn an alignment in a mutual way by performing mutual attention iteratively between the audio and visual features, relying on the modified encoder structure of Transformer. Experimental results show that our proposed method obtains absolute improvements up to 20.42% over the audio modality alone depending upon the signal-to-noise-ratio (SNR) level. Better recognition performance can also be achieved comparing with the traditional feature concatenation method under both clean and noisy conditions. It is expectable that our proposed mutual feature alignment method can be easily generalized to other multimodal tasks with semantically correlated information.
引用
收藏
页码:5348 / 5353
页数:6
相关论文
共 50 条
  • [1] END-TO-END AUDIOVISUAL SPEECH RECOGNITION
    Petridis, Stavros
    Stafylakis, Themos
    Ma, Pingchuan
    Cai, Feipeng
    Tzimiropoulos, Georgios
    Pantic, Maja
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6548 - 6552
  • [2] End-to-End Audiovisual Speech Recognition System With Multitask Learning
    Tao, Fei
    Busso, Carlos
    IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 1 - 11
  • [3] INVESTIGATIONS ON END-TO-END AUDIOVISUAL FUSION
    Wand, Michael
    Ngoc Thang Vu
    Schmidhuber, Juergen
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 3041 - 3045
  • [4] ALIGNING AUDIOVISUAL FEATURES FOR AUDIOVISUAL SPEECH RECOGNITION
    Tao, Fei
    Busso, Carlos
    2018 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2018,
  • [5] End-to-end audiovisual speech activity detection with bimodal recurrent neural models
    Tao, Fei
    Busso, Carlos
    SPEECH COMMUNICATION, 2019, 113 : 25 - 35
  • [6] End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild
    Dresvyanskiy, Denis
    Ryumina, Elena
    Kaya, Heysem
    Markitantov, Maxim
    Karpov, Alexey
    Minker, Wolfgang
    MULTIMODAL TECHNOLOGIES AND INTERACTION, 2022, 6 (02)
  • [7] End-to-End Automatic Speech Recognition with Deep Mutual Learning
    Masumura, Ryo
    Ihori, Mana
    Takashima, Akihiko
    Tanaka, Tomohiro
    Ashihara, Takanori
    2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 632 - 637
  • [8] End-to-end speech recognition with Alignment RNN-Transducer
    Tian, Ying
    Li, Zerui
    Liu, Min
    Ouchi, Kazushige
    Yan, Long
    Zhao, Dan
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [9] End-to-End Listening Agent for Audiovisual Emotional and Naturalistic Interactions
    El Haddad, Kevin
    Rizk, Yara
    Heron, Louise
    Hajj, Nadine
    Zhao, Yong
    Kim, Jaebok
    Ngo Trong Trung
    Lee, Minha
    Doumit, Marwan
    Lin, Payton
    Kim, Yelin
    Cakmak, Huseyin
    JOURNAL OF SCIENCE AND TECHNOLOGY OF THE ARTS, 2018, 10 (02) : 49 - 61
  • [10] A Neural Time Alignment Module for End-to-End Automatic Speech Recognition
    Jiang, Dongcheng
    Zhang, Chao
    Woodland, Philip C.
    INTERSPEECH 2023, 2023, : 1374 - 1378