Deep Cross-Modal Audio-Visual Generation

被引:292
|
作者
Chen, Lele [1 ]
Srivastava, Sudhanshu [1 ]
Duan, Zhiyao [2 ]
Xu, Chenliang [1 ]
机构
[1] Univ Rochester, Comp Sci, Rochester, NY 14627 USA
[2] Univ Rochester, Elect & Comp Engn, Rochester, NY 14627 USA
关键词
cross-modal generation; audio-visual; generative adversarial networks; PERCEPTION;
D O I
10.1145/3126686.3126723
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Cross-modal audio-visual perception has been a long-lasting topic in psychology and neurology, and various studies have discovered strong correlations in human perception of auditory and visual stimuli. Despite work on computational multimodal modeling, the problem of cross-modal audio-visual generation has not been systematically studied in the literature. In this paper, we make the first attempt to solve this cross-modal generation problem leveraging the power of deep generative adversarial training. Specifically, we use conditional generative adversarial networks to achieve cross-modal audio-visual generation of musical performances. We explore different encoding methods for audio and visual signals, and work on two scenarios: instrument-oriented generation and pose-oriented generation. Being the first to explore this new problem, we compose two new datasets with pairs of images and sounds of musical performances of different instruments. Our experiments using both classification and human evaluation demonstrate that our model has the ability to generate one modality, i.e., audio/visual, from the other modality, i.e., visual/audio, to a good extent. Our experiments on various design choices along with the datasets will facilitate future research in this new problem space.
引用
收藏
页码:349 / 357
页数:9
相关论文
共 50 条
  • [21] Audio-Visual Event Localization based on Cross-Modal Interacting Guidance
    Yue, Qiurui
    Wu, Xiaoyu
    Gao, Jiayi
    [J]. 2021 IEEE FOURTH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND KNOWLEDGE ENGINEERING (AIKE 2021), 2021, : 104 - 107
  • [22] PERFECT MATCH: IMPROVED CROSS-MODAL EMBEDDINGS FOR AUDIO-VISUAL SYNCHRONISATION
    Chung, Soo-Whan
    Chung, Joon Son
    Kang, Hong-Goo
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 3965 - 3969
  • [23] Cross-Modal Mutual Learning for Audio-Visual Speech Recognition and Manipulation
    Yang, Chih-Chun
    Fan, Wan-Cyuan
    Yang, Cheng-Fu
    Wang, Yu-Chiang Frank
    [J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 3036 - 3044
  • [24] SCLAV: Supervised Cross-modal Contrastive Learning for Audio-Visual Coding
    Sun, Chao
    Chen, Min
    Cheng, Jialiang
    Liang, Han
    Zhu, Chuanbo
    Chen, Jincai
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 261 - 270
  • [25] A NOVEL DISTANCE LEARNING FOR ELASTIC CROSS-MODAL AUDIO-VISUAL MATCHING
    Wangrui
    Huang, Huaibo
    Zhang, Xufeng
    Ma, Jixin
    Zheng, Aihua
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW), 2019, : 300 - 305
  • [26] Cross-Modal Label Contrastive Learning for Unsupervised Audio-Visual Event Localization
    Bao, Peijun
    Yang, Wenhan
    Boon Poh Ng
    Er, Meng Hwa
    Kot, Alex C.
    [J]. THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 215 - 222
  • [27] Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization
    Xu, Haoming
    Zeng, Runhao
    Wu, Qingyao
    Tan, Mingkui
    Gan, Chuang
    [J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3893 - 3901
  • [28] Cross-Modal Matching of Audio-Visual German and French Fluent Speech in Infancy
    Kubicek, Claudia
    de Boisferon, Anne Hillairet
    Dupierrix, Eve
    Pascalis, Olivier
    Loevenbruck, Helene
    Gervain, Judit
    Schwarzer, Gudrun
    [J]. PLOS ONE, 2014, 9 (02):
  • [29] Cross-modal selection of multiple features: ERPs to audio-visual compound stimuli
    Balazs, L
    Czigler, I
    [J]. JOURNAL OF PSYCHOPHYSIOLOGY, 1998, 12 (01) : 78 - 78
  • [30] Audio-visual cross-modal concept of familiar persons in dogs (Canis familiaris)
    Ogura, Tadatoshi
    Izumi, Shoko
    Imai, Miku
    Nagano, Sakurako
    Matsuura, Akihiro
    [J]. INTERNATIONAL JOURNAL OF PSYCHOLOGY, 2016, 51 : 261 - 261