Deep Cross-Modal Audio-Visual Generation

被引:292
|
作者
Chen, Lele [1 ]
Srivastava, Sudhanshu [1 ]
Duan, Zhiyao [2 ]
Xu, Chenliang [1 ]
机构
[1] Univ Rochester, Comp Sci, Rochester, NY 14627 USA
[2] Univ Rochester, Elect & Comp Engn, Rochester, NY 14627 USA
关键词
cross-modal generation; audio-visual; generative adversarial networks; PERCEPTION;
D O I
10.1145/3126686.3126723
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Cross-modal audio-visual perception has been a long-lasting topic in psychology and neurology, and various studies have discovered strong correlations in human perception of auditory and visual stimuli. Despite work on computational multimodal modeling, the problem of cross-modal audio-visual generation has not been systematically studied in the literature. In this paper, we make the first attempt to solve this cross-modal generation problem leveraging the power of deep generative adversarial training. Specifically, we use conditional generative adversarial networks to achieve cross-modal audio-visual generation of musical performances. We explore different encoding methods for audio and visual signals, and work on two scenarios: instrument-oriented generation and pose-oriented generation. Being the first to explore this new problem, we compose two new datasets with pairs of images and sounds of musical performances of different instruments. Our experiments using both classification and human evaluation demonstrate that our model has the ability to generate one modality, i.e., audio/visual, from the other modality, i.e., visual/audio, to a good extent. Our experiments on various design choices along with the datasets will facilitate future research in this new problem space.
引用
收藏
页码:349 / 357
页数:9
相关论文
共 50 条
  • [1] Cross-modal prediction in audio-visual communication
    Rao, RR
    Chen, TH
    [J]. 1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 2056 - 2059
  • [2] Audio-Visual Cross-Modal Generation with Multimodal Variational Generative Model
    Xu, Zhubin
    Wang, Tianlei
    Liu, Dekang
    Hu, Dinghan
    Zeng, Huangiang
    Cao, Jiuwen
    [J]. 2024 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, ISCAS 2024, 2024,
  • [3] LEARNING AUDIO-VISUAL CORRELATIONS FROM VARIATIONAL CROSS-MODAL GENERATION
    Zhu, Ye
    Wu, Yu
    Latapie, Hugo
    Yang, Yi
    Yan, Yan
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 4300 - 4304
  • [4] Cross-Modal Analysis of Audio-Visual Film Montage
    Zeppelzauer, Matthias
    Mitrovic, Dalibor
    Breiteneder, Christian
    [J]. 2011 20TH INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATIONS AND NETWORKS (ICCCN), 2011,
  • [5] Audio-Visual Instance Discrimination with Cross-Modal Agreement
    Morgado, Pedro
    Vasconcelos, Nuno
    Misra, Ishan
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 12470 - 12481
  • [6] Cross-Modal learning for Audio-Visual Video Parsing
    Lamba, Jatin
    Abhishek
    Akula, Jayaprakash
    Dabral, Rishabh
    Jyothi, Preethi
    Ramakrishnan, Ganesh
    [J]. INTERSPEECH 2021, 2021, : 1937 - 1941
  • [7] Temporal Cross-Modal Attention for Audio-Visual Event Localization
    Nagasaki Y.
    Hayashi M.
    Kaneko N.
    Aoki Y.
    [J]. Seimitsu Kogaku Kaishi/Journal of the Japan Society for Precision Engineering, 2022, 88 (03): : 263 - 268
  • [8] Effect of Uncertainty in Audio-Visual Cross-Modal Statistical Learning
    Nagy, Marton
    Reguly, Helga
    Markus, Benjamin
    Fiser, Jozsef
    [J]. PERCEPTION, 2019, 48 : 109 - 109
  • [9] Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics
    Liu, Chen
    Li, Peike Patrick
    Qi, Xingqun
    Zhang, Hu
    Li, Lincheng
    Wang, Dadong
    Yu, Xin
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 7590 - 7598
  • [10] Variational Autoencoder with CCA for Audio-Visual Cross-modal Retrieval
    Zhang, Jiwei
    Yu, Yi
    Tang, Suhua
    Wu, Jianming
    Li, Wei
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (03)