Continuous Phoneme Recognition based on Audio-Visual Modality Fusion

被引：1

作者：

Richter, Julius ^{[1
]}

Liebold, Jeanine ^{[1
,2
]}

Gerkamnn, Timo ^{[1
]}

机构：

[1] Univ Hamburg, Signal Proc SP, Hamburg, Germany

[2] Univ Hamburg, ZBH Ctr Bioinformat, Hamburg, Germany

来源：

2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN) | 2022年

关键词：

phoneme recognition; modality fusion; audiovisual; feature extraction; deep learning; SPEECH RECOGNITION;

D O I：

10.1109/IJCNN55064.2022.9892053

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

While state-of-the-art audio-only phoneme recognition is already at a high standard, the robustness of existing methods still drops in very noisy environments. To mitigate these limitations, visual information can be incorporated into the recognition system, such that the problem is formulated in a multi-modal setting. To this end, we develop a continuous, audio-visual phoneme classifier that takes raw audio waveforms and video frames as input. Both modalities are processed by individual feature extraction models before a fusion model exploits their correlations. Audio features are extracted with a residual neural network, while video features are obtained with a convolutional neural network. Furthermore, we model temporal dependencies with gated recurrent units. For modality fusion, we compare simple concatenation, attention-based methods, as well as squeeze-and-excitation to learn a joint representation. We train our models on the NTCD-TIMIT dataset, using distinct noise types from the QUT dataset for the test. By pre-training the feature extraction models on the individual modalities first, we achieve best performance for the audio-visual model that is trained end-to-end. In the experiments, we show that by including the video modality, we increase the accuracy of phoneme prediction by 9% in very noisy acoustic environments. The results indicate that in such environments our approach remains more robust compared to existing methods. The code and pre-trained models are available online(1).

引用

页数：8

共 50 条

[1] Distinctive feature fusion for improved audio-visual phoneme recognition
Lewis, T
Powers, D
[J]. ISSPA 2005: THE 8TH INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND ITS APPLICATIONS, VOLS 1 AND 2, PROCEEDINGS, 2005, : 62 - 65
[2] Continuous audio-visual digit recognition using decision fusion
Meyer, G
Mulligan, J
[J]. 2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 305 - 308
[3] Audio-Visual Speech Recognition Based on AAM Parameter and Phoneme Analysis of Visual Feature
Komai, Yuto
Ariki, Yasuo
Takiguchi, Tetsuya
[J]. ADVANCES IN IMAGE AND VIDEO TECHNOLOGY, PT I, 2011, 7087 : 97 - 108
[4] Robust Audio-Visual Speech Recognition Based on Hybrid Fusion
Liu, Hong
Li, Wenhao
Yang, Bing
[J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 7580 - 7586
[5] Scene recognition with audio-visual sensor fusion
Devicharan, D
Mehrotra, KG
Mohan, CK
Varshney, PK
Zuo, L
[J]. Multisensor, Multisource Information Fusion: Architectures, Algorithms and Applications 2005, 2005, 5813 : 201 - 210
[6] Multifactor fusion for audio-visual speaker recognition
Chetty, Girija
Tran, Dat
[J]. LECTURE NOTES IN SIGNAL SCIENCE, INTERNET AND EDUCATION (SSIP'07/MIV'07/DIWEB'07), 2007, : 70 - +
[7] Bimodal fusion in audio-visual speech recognition
Zhang, XZ
Mersereau, RM
Clements, M
[J]. 2002 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL I, PROCEEDINGS, 2002, : 964 - 967
[8] Continuous Emotion Recognition with Audio-visual Leader-follower Attentive Fusion
Zhang, Su
Ding, Yi
Wei, Ziquan
Guan, Cuntai
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3560 - 3567
[9] Audio-Visual Fusion Network Based on Conformer for Multimodal Emotion Recognition
Guo, Peini
Chen, Zhengyan
Li, Yidi
Liu, Hong
[J]. ARTIFICIAL INTELLIGENCE, CICAI 2022, PT II, 2022, 13605 : 315 - 326
[10] Audio-Visual Sensor Fusion Framework Using Person Attributes Robust to Missing Visual Modality for Person Recognition
John, Vijay
Kawanishi, Yasutomo
[J]. MULTIMEDIA MODELING, MMM 2023, PT II, 2023, 13834 : 523 - 535

← 1 2 3 4 5 →