Deep Cross-Modal Audio-Visual Generation

被引：299

作者：

Chen, Lele ^{[1
]}

Srivastava, Sudhanshu ^{[1
]}

Duan, Zhiyao ^{[2
]}

Xu, Chenliang ^{[1
]}

机构：

[1] Univ Rochester, Comp Sci, Rochester, NY 14627 USA

[2] Univ Rochester, Elect & Comp Engn, Rochester, NY 14627 USA

来源：

PROCEEDINGS OF THE THEMATIC WORKSHOPS OF ACM MULTIMEDIA 2017 (THEMATIC WORKSHOPS'17) | 2017年

关键词：

cross-modal generation; audio-visual; generative adversarial networks; PERCEPTION;

D O I：

10.1145/3126686.3126723

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Cross-modal audio-visual perception has been a long-lasting topic in psychology and neurology, and various studies have discovered strong correlations in human perception of auditory and visual stimuli. Despite work on computational multimodal modeling, the problem of cross-modal audio-visual generation has not been systematically studied in the literature. In this paper, we make the first attempt to solve this cross-modal generation problem leveraging the power of deep generative adversarial training. Specifically, we use conditional generative adversarial networks to achieve cross-modal audio-visual generation of musical performances. We explore different encoding methods for audio and visual signals, and work on two scenarios: instrument-oriented generation and pose-oriented generation. Being the first to explore this new problem, we compose two new datasets with pairs of images and sounds of musical performances of different instruments. Our experiments using both classification and human evaluation demonstrate that our model has the ability to generate one modality, i.e., audio/visual, from the other modality, i.e., visual/audio, to a good extent. Our experiments on various design choices along with the datasets will facilitate future research in this new problem space.

引用

页码：349 / 357

页数：9

共 50 条

[21] Adversarial-Metric Learning for Audio-Visual Cross-Modal Matching
Zheng, Aihua
Hu, Menglan
Jiang, Bo
Huang, Yan
Yan, Yan
Luo, Bin
IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 338 - 351
[22] Audio-Visual Event Localization based on Cross-Modal Interacting Guidance
Yue, Qiurui
Wu, Xiaoyu
Gao, Jiayi
2021 IEEE FOURTH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND KNOWLEDGE ENGINEERING (AIKE 2021), 2021, : 104 - 107
[23] PERFECT MATCH: IMPROVED CROSS-MODAL EMBEDDINGS FOR AUDIO-VISUAL SYNCHRONISATION
Chung, Soo-Whan
Chung, Joon Son
Kang, Hong-Goo
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 3965 - 3969
[24] SCLAV: Supervised Cross-modal Contrastive Learning for Audio-Visual Coding
Sun, Chao
Chen, Min
Cheng, Jialiang
Liang, Han
Zhu, Chuanbo
Chen, Jincai
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 261 - 270
[25] Cross-Modal Mutual Learning for Audio-Visual Speech Recognition and Manipulation
Yang, Chih-Chun
Fan, Wan-Cyuan
Yang, Cheng-Fu
Wang, Yu-Chiang Frank
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 3036 - 3044
[26] A NOVEL DISTANCE LEARNING FOR ELASTIC CROSS-MODAL AUDIO-VISUAL MATCHING
Wangrui
Huang, Huaibo
Zhang, Xufeng
Ma, Jixin
Zheng, Aihua
2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW), 2019, : 300 - 305
[27] Cross-Modal Matching of Audio-Visual German and French Fluent Speech in Infancy
Kubicek, Claudia
de Boisferon, Anne Hillairet
Dupierrix, Eve
Pascalis, Olivier
Loevenbruck, Helene
Gervain, Judit
Schwarzer, Gudrun
PLOS ONE, 2014, 9 (02):
[28] Cross-Modal Label Contrastive Learning for Unsupervised Audio-Visual Event Localization
Bao, Peijun
Yang, Wenhan
Boon Poh Ng
Er, Meng Hwa
Kot, Alex C.
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 215 - 222
[29] Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization
Xu, Haoming
Zeng, Runhao
Wu, Qingyao
Tan, Mingkui
Gan, Chuang
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3893 - 3901
[30] Audio-visual cross-modal concept of familiar persons in dogs (Canis familiaris)
Ogura, Tadatoshi
Izumi, Shoko
Imai, Miku
Nagano, Sakurako
Matsuura, Akihiro
INTERNATIONAL JOURNAL OF PSYCHOLOGY, 2016, 51 : 261 - 261

← 1 2 3 4 5 →