Continuous Phoneme Recognition based on Audio-Visual Modality Fusion

被引:1
|
作者
Richter, Julius [1 ]
Liebold, Jeanine [1 ,2 ]
Gerkamnn, Timo [1 ]
机构
[1] Univ Hamburg, Signal Proc SP, Hamburg, Germany
[2] Univ Hamburg, ZBH Ctr Bioinformat, Hamburg, Germany
关键词
phoneme recognition; modality fusion; audiovisual; feature extraction; deep learning; SPEECH RECOGNITION;
D O I
10.1109/IJCNN55064.2022.9892053
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
While state-of-the-art audio-only phoneme recognition is already at a high standard, the robustness of existing methods still drops in very noisy environments. To mitigate these limitations, visual information can be incorporated into the recognition system, such that the problem is formulated in a multi-modal setting. To this end, we develop a continuous, audio-visual phoneme classifier that takes raw audio waveforms and video frames as input. Both modalities are processed by individual feature extraction models before a fusion model exploits their correlations. Audio features are extracted with a residual neural network, while video features are obtained with a convolutional neural network. Furthermore, we model temporal dependencies with gated recurrent units. For modality fusion, we compare simple concatenation, attention-based methods, as well as squeeze-and-excitation to learn a joint representation. We train our models on the NTCD-TIMIT dataset, using distinct noise types from the QUT dataset for the test. By pre-training the feature extraction models on the individual modalities first, we achieve best performance for the audio-visual model that is trained end-to-end. In the experiments, we show that by including the video modality, we increase the accuracy of phoneme prediction by 9% in very noisy acoustic environments. The results indicate that in such environments our approach remains more robust compared to existing methods. The code and pre-trained models are available online(1).
引用
收藏
页数:8
相关论文
共 50 条
  • [1] Distinctive feature fusion for improved audio-visual phoneme recognition
    Lewis, T
    Powers, D
    [J]. ISSPA 2005: THE 8TH INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND ITS APPLICATIONS, VOLS 1 AND 2, PROCEEDINGS, 2005, : 62 - 65
  • [2] Continuous audio-visual digit recognition using decision fusion
    Meyer, G
    Mulligan, J
    [J]. 2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 305 - 308
  • [3] Audio-Visual Speech Recognition Based on AAM Parameter and Phoneme Analysis of Visual Feature
    Komai, Yuto
    Ariki, Yasuo
    Takiguchi, Tetsuya
    [J]. ADVANCES IN IMAGE AND VIDEO TECHNOLOGY, PT I, 2011, 7087 : 97 - 108
  • [4] Robust Audio-Visual Speech Recognition Based on Hybrid Fusion
    Liu, Hong
    Li, Wenhao
    Yang, Bing
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 7580 - 7586
  • [5] Scene recognition with audio-visual sensor fusion
    Devicharan, D
    Mehrotra, KG
    Mohan, CK
    Varshney, PK
    Zuo, L
    [J]. Multisensor, Multisource Information Fusion: Architectures, Algorithms and Applications 2005, 2005, 5813 : 201 - 210
  • [6] Multifactor fusion for audio-visual speaker recognition
    Chetty, Girija
    Tran, Dat
    [J]. LECTURE NOTES IN SIGNAL SCIENCE, INTERNET AND EDUCATION (SSIP'07/MIV'07/DIWEB'07), 2007, : 70 - +
  • [7] Bimodal fusion in audio-visual speech recognition
    Zhang, XZ
    Mersereau, RM
    Clements, M
    [J]. 2002 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL I, PROCEEDINGS, 2002, : 964 - 967
  • [8] Continuous Emotion Recognition with Audio-visual Leader-follower Attentive Fusion
    Zhang, Su
    Ding, Yi
    Wei, Ziquan
    Guan, Cuntai
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3560 - 3567
  • [9] Audio-Visual Fusion Network Based on Conformer for Multimodal Emotion Recognition
    Guo, Peini
    Chen, Zhengyan
    Li, Yidi
    Liu, Hong
    [J]. ARTIFICIAL INTELLIGENCE, CICAI 2022, PT II, 2022, 13605 : 315 - 326
  • [10] Audio-Visual Sensor Fusion Framework Using Person Attributes Robust to Missing Visual Modality for Person Recognition
    John, Vijay
    Kawanishi, Yasutomo
    [J]. MULTIMEDIA MODELING, MMM 2023, PT II, 2023, 13834 : 523 - 535