Learning Bimodal Structure in Audio-Visual Data

被引:25
|
作者
Monaci, Gianluca [1 ]
Vandergheynst, Pierre [2 ]
Sommer, Friedrich T. [1 ]
机构
[1] Univ Calif Berkeley, Redwood Ctr Theoret Neurosci, Berkeley, CA 94720 USA
[2] Ecole Polytech Fed Lausanne, Inst Elect Engn, CH-1015 Lausanne, Switzerland
来源
IEEE TRANSACTIONS ON NEURAL NETWORKS | 2009年 / 20卷 / 12期
基金
瑞士国家科学基金会; 美国国家科学基金会;
关键词
Audio-visual source localization; dictionary learning; matching pursuit (MP); multimodal data processing; sparse representation; SOURCE SEPARATION; SPARSE; REPRESENTATIONS; APPROXIMATION; RECOGNITION; EXTRACTION; SEQUENCES; SOUNDS; LEVEL;
D O I
10.1109/TNN.2009.2032182
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A novel model is presented to learn bimodally informative structures from audio-visual signals. The signal is represented as a sparse sum of audio-visual kernels. Each kernel is a bimodal function consisting of synchronous snippets of an audio waveform and a spatio-temporal visual basis function. To represent an audio-visual signal, the kernels can be positioned independently and arbitrarily in space and time. The proposed algorithm uses unsupervised learning to form dictionaries of bimodal kernels from audio-visual material. The basis functions that emerge during learning capture salient audio-visual data structures. In addition, it is demonstrated that the learned dictionary can be used to locate sources of sound in the movie frame. Specifically, in sequences containing two speakers, the algorithm can robustly localize a speaker even in the presence of severe acoustic and visual distracters.
引用
收藏
页码:1898 / 1910
页数:13
相关论文
共 50 条
  • [41] Audio-Visual Objects
    Kubovy M.
    Schutz M.
    Review of Philosophy and Psychology, 2010, 1 (1) : 41 - 61
  • [42] Audio-Visual Segmentation
    Zhou, Jinxing
    Wang, Jianyuan
    Zhang, Jiayi
    Sun, Weixuan
    Zhang, Jing
    Birchfield, Stan
    Guo, Dan
    Kong, Lingpeng
    Wang, Meng
    Zhong, Yiran
    COMPUTER VISION, ECCV 2022, PT XXXVII, 2022, 13697 : 386 - 403
  • [43] Distilling Audio-Visual Knowledge by Compositional Contrastive Learning
    Chen, Yanbei
    Xian, Yongqin
    Koepke, A. Sophia
    Shan, Ying
    Akata, Zeynep
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 7012 - 7021
  • [44] AUDIO-VISUAL CLINICS
    GRABER, TM
    HANNETT, HA
    AMERICAN JOURNAL OF ORTHODONTICS AND DENTOFACIAL ORTHOPEDICS, 1963, 49 (07) : 538 - &
  • [45] Audio-visual resources and learning improvement: an experimental analysis
    Magadan-Diaz, Marta
    Rivas-Garcia, Jesus I.
    INTERNATIONAL JOURNAL OF LEARNING TECHNOLOGY, 2023, 18 (01) : 79 - 93
  • [46] Hyperbolic Audio-visual Zero-shot Learning
    Hong, Jie
    Hayder, Zeeshan
    Han, Junlin
    Fang, Pengfei
    Harandi, Mehrtash
    Petersson, Lars
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 7839 - 7849
  • [47] DEEP MULTIMODAL LEARNING FOR AUDIO-VISUAL SPEECH RECOGNITION
    Mroueh, Youssef
    Marcheret, Etienne
    Goel, Vaibhava
    2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 2130 - 2134
  • [48] An Audio-Visual Attention System for Online Association Learning
    Heckmann, Martin
    Brandl, Holger
    Domont, Xavier
    Bolder, Bram
    Joublin, Frank
    Goerick, Christian
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 2127 - 2130
  • [49] Audio-visual imposture
    Karam, Walid
    Mokbel, Chafic
    Greige, Hanna
    Chollet, Gerard
    MOBILE MULTIMEDIA/IMAGE PROCESSING FOR MILITARY AND SECURITY APPLICATIONS, 2006, 6250
  • [50] AUDIO-VISUAL TECHNOLOGIES
    TAKESHITA, M
    FURUKAWA, M
    HAYATSU, R
    MURAKAMI, R
    SUZUKI, K
    HASHIZUME, K
    NEC RESEARCH & DEVELOPMENT, 1990, (96): : 265 - 277