Deep Cross-Modal Audio-Visual Generation

被引:299
|
作者
Chen, Lele [1 ]
Srivastava, Sudhanshu [1 ]
Duan, Zhiyao [2 ]
Xu, Chenliang [1 ]
机构
[1] Univ Rochester, Comp Sci, Rochester, NY 14627 USA
[2] Univ Rochester, Elect & Comp Engn, Rochester, NY 14627 USA
来源
PROCEEDINGS OF THE THEMATIC WORKSHOPS OF ACM MULTIMEDIA 2017 (THEMATIC WORKSHOPS'17) | 2017年
关键词
cross-modal generation; audio-visual; generative adversarial networks; PERCEPTION;
D O I
10.1145/3126686.3126723
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Cross-modal audio-visual perception has been a long-lasting topic in psychology and neurology, and various studies have discovered strong correlations in human perception of auditory and visual stimuli. Despite work on computational multimodal modeling, the problem of cross-modal audio-visual generation has not been systematically studied in the literature. In this paper, we make the first attempt to solve this cross-modal generation problem leveraging the power of deep generative adversarial training. Specifically, we use conditional generative adversarial networks to achieve cross-modal audio-visual generation of musical performances. We explore different encoding methods for audio and visual signals, and work on two scenarios: instrument-oriented generation and pose-oriented generation. Being the first to explore this new problem, we compose two new datasets with pairs of images and sounds of musical performances of different instruments. Our experiments using both classification and human evaluation demonstrate that our model has the ability to generate one modality, i.e., audio/visual, from the other modality, i.e., visual/audio, to a good extent. Our experiments on various design choices along with the datasets will facilitate future research in this new problem space.
引用
收藏
页码:349 / 357
页数:9
相关论文
共 50 条
  • [31] Cross-modal selection of multiple features: ERPs to audio-visual compound stimuli
    Balazs, L
    Czigler, I
    JOURNAL OF PSYCHOPHYSIOLOGY, 1998, 12 (01) : 78 - 78
  • [32] Attribute-Guided Cross-Modal Interaction and Enhancement for Audio-Visual Matching
    Wang, Jiaxiang
    Zheng, Aihua
    Yan, Yan
    He, Ran
    Tang, Jin
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2024, 19 : 4986 - 4998
  • [33] Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition
    Hu, Yuchen
    Li, Ruizhe
    Chen, Chen
    Zou, Heqing
    Zhu, Qiushi
    Chng, Eng Siong
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 5076 - 5084
  • [34] Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond
    Li, Jiahong
    Li, Chenda
    Wu, Yifei
    Qian, Yanmin
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1941 - 1953
  • [35] Online Cross-Modal Adaptation for Audio-Visual Person Identification With Wearable Cameras
    Brutti, Alessio
    Cavallaro, Andrea
    IEEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, 2017, 47 (01) : 40 - 51
  • [36] Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization
    Xuan, Hanyu
    Zhang, Zhenyu
    Chen, Shuo
    Yang, Jian
    Yan, Yan
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 279 - 286
  • [37] Temporal and Cross-modal Attention for Audio-Visual Zero-Shot Learning
    Mercea, Otniel-Bogdan
    Hummel, Thomas
    Koepke, A. Sophia
    Akata, Zeynep
    COMPUTER VISION, ECCV 2022, PT XX, 2022, 13680 : 488 - 505
  • [38] Unsupervised cross-modal deep-model adaptation for audio-visual re-identification with wearable cameras
    Brutti, Alessio
    Cavallaro, Andrea
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2017), 2017, : 438 - 445
  • [39] Complete Cross-triplet Loss in Label Space for Audio-visual Cross-modal Retrieval
    Zeng, Donghuo
    Wang, Yanan
    Wu, Jianming
    Ikeda, Kazushi
    2022 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM), 2022, : 1 - 9
  • [40] Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection
    Sharma, Rahul
    Narayanan, Shrikanth
    IEEE OPEN JOURNAL OF SIGNAL PROCESSING, 2023, 4 : 225 - 232