Deep Cross-Modal Audio-Visual Generation

被引：299

作者：

Chen, Lele ^{[1
]}

Srivastava, Sudhanshu ^{[1
]}

Duan, Zhiyao ^{[2
]}

Xu, Chenliang ^{[1
]}

机构：

[1] Univ Rochester, Comp Sci, Rochester, NY 14627 USA

[2] Univ Rochester, Elect & Comp Engn, Rochester, NY 14627 USA

来源：

PROCEEDINGS OF THE THEMATIC WORKSHOPS OF ACM MULTIMEDIA 2017 (THEMATIC WORKSHOPS'17) | 2017年

关键词：

cross-modal generation; audio-visual; generative adversarial networks; PERCEPTION;

D O I：

10.1145/3126686.3126723

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Cross-modal audio-visual perception has been a long-lasting topic in psychology and neurology, and various studies have discovered strong correlations in human perception of auditory and visual stimuli. Despite work on computational multimodal modeling, the problem of cross-modal audio-visual generation has not been systematically studied in the literature. In this paper, we make the first attempt to solve this cross-modal generation problem leveraging the power of deep generative adversarial training. Specifically, we use conditional generative adversarial networks to achieve cross-modal audio-visual generation of musical performances. We explore different encoding methods for audio and visual signals, and work on two scenarios: instrument-oriented generation and pose-oriented generation. Being the first to explore this new problem, we compose two new datasets with pairs of images and sounds of musical performances of different instruments. Our experiments using both classification and human evaluation demonstrate that our model has the ability to generate one modality, i.e., audio/visual, from the other modality, i.e., visual/audio, to a good extent. Our experiments on various design choices along with the datasets will facilitate future research in this new problem space.

引用

页码：349 / 357

页数：9

共 50 条

[1] Cross-modal prediction in audio-visual communication
Rao, RR
Chen, TH
1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 2056 - 2059
[2] Audio-Visual Cross-Modal Generation with Multimodal Variational Generative Model
Xu, Zhubin
Wang, Tianlei
Liu, Dekang
Hu, Dinghan
Zeng, Huangiang
Cao, Jiuwen
2024 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, ISCAS 2024, 2024,
[3] LEARNING AUDIO-VISUAL CORRELATIONS FROM VARIATIONAL CROSS-MODAL GENERATION
Zhu, Ye
Wu, Yu
Latapie, Hugo
Yang, Yi
Yan, Yan
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 4300 - 4304
[4] Cross-Modal Analysis of Audio-Visual Film Montage
Zeppelzauer, Matthias
Mitrovic, Dalibor
Breiteneder, Christian
2011 20TH INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATIONS AND NETWORKS (ICCCN), 2011,
[5] Audio-Visual Instance Discrimination with Cross-Modal Agreement
Morgado, Pedro
Vasconcelos, Nuno
Misra, Ishan
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 12470 - 12481
[6] Cross-Modal learning for Audio-Visual Video Parsing
Lamba, Jatin
Abhishek
Akula, Jayaprakash
Dabral, Rishabh
Jyothi, Preethi
Ramakrishnan, Ganesh
INTERSPEECH 2021, 2021, : 1937 - 1941
[7] Variational Autoencoder with CCA for Audio-Visual Cross-modal Retrieval
Zhang, Jiwei
Yu, Yi
Tang, Suhua
Wu, Jianming
Li, Wei
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (03)
[8] Temporal Cross-Modal Attention for Audio-Visual Event Localization
Nagasaki Y.
Hayashi M.
Kaneko N.
Aoki Y.
Seimitsu Kogaku Kaishi/Journal of the Japan Society for Precision Engineering, 2022, 88 (03): : 263 - 268
[9] Effect of Uncertainty in Audio-Visual Cross-Modal Statistical Learning
Nagy, Marton
Reguly, Helga
Markus, Benjamin
Fiser, Jozsef
PERCEPTION, 2019, 48 : 109 - 109
[10] Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics
Liu, Chen
Li, Peike Patrick
Qi, Xingqun
Zhang, Hu
Li, Lincheng
Wang, Dadong
Yu, Xin
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 7590 - 7598

← 1 2 3 4 5 →