Audio-Visual Cross-Modal Generation with Multimodal Variational Generative Model

被引:0
|
作者
Xu, Zhubin [2 ]
Wang, Tianlei [1 ,2 ]
Liu, Dekang [1 ,2 ]
Hu, Dinghan [1 ,2 ]
Zeng, Huangiang [3 ,4 ]
Cao, Jiuwen [1 ,2 ]
机构
[1] Machine Learning & I Hlth Int Cooperat Base Zheji, Hangzhou, Zhejiang, Peoples R China
[2] Hangzhou Dianzi Univ, Artificial Intelligence Inst, Hangzhou, Zhejiang, Peoples R China
[3] Huagiao Univ, Sch Engn, Fujian, Peoples R China
[4] Huagiao Univ, Sch Informat Sci & Engn, Fujian, Peoples R China
基金
中国国家自然科学基金;
关键词
Audio-Visual fusion; Cross-modal generation; Variational autoencoder; Adversarial learning;
D O I
10.1109/ISCAS58744.2024.10557902
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Audio and Visual are two important visual modalities in video content understanding. However, the absence of one modality may be observed in practical applications due to the real environmental factors, which leads to the information loss. Therefore, audio and visual fusion is focused on using the shared and complementary information between modalities to recover the missing modalities from the available data modalities. In this paper, an Adversarial Hierarchical Variational AutoEncoder (Adv-HVAE) model is proposed to solve this problem of modality data loss. A multimodal representation is first learned using a hierarchical Variational Autoencoder (VAE) model that enables the generation of missing modal data under any subset of available modalities. Also to obtain a more robust multimodal representation, a feature generation network is utilized to approximate the latent distribution of missing modalities. Finally, the adversarial training network is shown to be effective in improving the data quality generated through the Adv-HVAE framework. Experimental results demonstrate that Adv-HVAE achieves best generation results on two benchmark datasets, avMNIST and Sub-URMP.
引用
收藏
页数:5
相关论文
共 50 条
  • [1] LEARNING AUDIO-VISUAL CORRELATIONS FROM VARIATIONAL CROSS-MODAL GENERATION
    Zhu, Ye
    Wu, Yu
    Latapie, Hugo
    Yang, Yi
    Yan, Yan
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 4300 - 4304
  • [2] Deep Cross-Modal Audio-Visual Generation
    Chen, Lele
    Srivastava, Sudhanshu
    Duan, Zhiyao
    Xu, Chenliang
    [J]. PROCEEDINGS OF THE THEMATIC WORKSHOPS OF ACM MULTIMEDIA 2017 (THEMATIC WORKSHOPS'17), 2017, : 349 - 357
  • [3] Variational Autoencoder with CCA for Audio-Visual Cross-modal Retrieval
    Zhang, Jiwei
    Yu, Yi
    Tang, Suhua
    Wu, Jianming
    Li, Wei
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (03)
  • [4] Cross-modal prediction in audio-visual communication
    Rao, RR
    Chen, TH
    [J]. 1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 2056 - 2059
  • [5] Cross-Modal Analysis of Audio-Visual Film Montage
    Zeppelzauer, Matthias
    Mitrovic, Dalibor
    Breiteneder, Christian
    [J]. 2011 20TH INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATIONS AND NETWORKS (ICCCN), 2011,
  • [6] Audio-Visual Instance Discrimination with Cross-Modal Agreement
    Morgado, Pedro
    Vasconcelos, Nuno
    Misra, Ishan
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 12470 - 12481
  • [7] Cross-Modal learning for Audio-Visual Video Parsing
    Lamba, Jatin
    Abhishek
    Akula, Jayaprakash
    Dabral, Rishabh
    Jyothi, Preethi
    Ramakrishnan, Ganesh
    [J]. INTERSPEECH 2021, 2021, : 1937 - 1941
  • [8] Temporal Cross-Modal Attention for Audio-Visual Event Localization
    Nagasaki Y.
    Hayashi M.
    Kaneko N.
    Aoki Y.
    [J]. Seimitsu Kogaku Kaishi/Journal of the Japan Society for Precision Engineering, 2022, 88 (03): : 263 - 268
  • [9] Effect of Uncertainty in Audio-Visual Cross-Modal Statistical Learning
    Nagy, Marton
    Reguly, Helga
    Markus, Benjamin
    Fiser, Jozsef
    [J]. PERCEPTION, 2019, 48 : 109 - 109
  • [10] Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics
    Liu, Chen
    Li, Peike Patrick
    Qi, Xingqun
    Zhang, Hu
    Li, Lincheng
    Wang, Dadong
    Yu, Xin
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 7590 - 7598