Learning Bimodal Structure in Audio-Visual Data

被引:25
|
作者
Monaci, Gianluca [1 ]
Vandergheynst, Pierre [2 ]
Sommer, Friedrich T. [1 ]
机构
[1] Univ Calif Berkeley, Redwood Ctr Theoret Neurosci, Berkeley, CA 94720 USA
[2] Ecole Polytech Fed Lausanne, Inst Elect Engn, CH-1015 Lausanne, Switzerland
来源
IEEE TRANSACTIONS ON NEURAL NETWORKS | 2009年 / 20卷 / 12期
基金
瑞士国家科学基金会; 美国国家科学基金会;
关键词
Audio-visual source localization; dictionary learning; matching pursuit (MP); multimodal data processing; sparse representation; SOURCE SEPARATION; SPARSE; REPRESENTATIONS; APPROXIMATION; RECOGNITION; EXTRACTION; SEQUENCES; SOUNDS; LEVEL;
D O I
10.1109/TNN.2009.2032182
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A novel model is presented to learn bimodally informative structures from audio-visual signals. The signal is represented as a sparse sum of audio-visual kernels. Each kernel is a bimodal function consisting of synchronous snippets of an audio waveform and a spatio-temporal visual basis function. To represent an audio-visual signal, the kernels can be positioned independently and arbitrarily in space and time. The proposed algorithm uses unsupervised learning to form dictionaries of bimodal kernels from audio-visual material. The basis functions that emerge during learning capture salient audio-visual data structures. In addition, it is demonstrated that the learned dictionary can be used to locate sources of sound in the movie frame. Specifically, in sequences containing two speakers, the algorithm can robustly localize a speaker even in the presence of severe acoustic and visual distracters.
引用
收藏
页码:1898 / 1910
页数:13
相关论文
共 50 条
  • [21] ADVERSARIAL INPUT ABLATION FOR AUDIO-VISUAL LEARNING
    Xu, David
    Harwath, David
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7742 - 7746
  • [22] AUDIO-VISUAL SPEECH INPAINTING WITH DEEP LEARNING
    Morrone, Giovanni
    Michelsanti, Daniel
    Tan, Zheng-Hua
    Jensen, Jesper
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6653 - 6657
  • [23] Audio-Visual Class-Incremental Learning
    Pian, Weiguo
    Mo, Shentong
    Guo, Yunhui
    Tian, Yapeng
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 7765 - 7777
  • [24] An audio-visual distance for audio-visual speech vector quantization
    Girin, L
    Foucher, E
    Feng, G
    1998 IEEE SECOND WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 1998, : 523 - 528
  • [25] Catching audio-visual mice:: The extrapolation of audio-visual speed
    Hofbauer, MM
    Wuerger, SM
    Meyer, GF
    Röhrbein, F
    Schill, K
    Zetzsche, C
    PERCEPTION, 2003, 32 : 96 - 96
  • [26] AN AUDIO-VISUAL AIDS AND PROGRAMMED LEARNING UNIT
    LEYTHAM, G
    MEDICAL AND BIOLOGICAL ILLUSTRATION, 1970, 20 (01): : 35 - &
  • [27] AUDIO-VISUAL LEARNING AIDS FOR THE PRIMARY GRADES
    Gray, H. A.
    ELEMENTARY SCHOOL JOURNAL, 1938, 38 (07): : 509 - 517
  • [28] The use of storyboards in audio-visual data collection
    Yang, Z.
    Wang, X.
    Rothkrantz, L. J. M.
    EUROMEDIA '2007, 2007, : 81 - 87
  • [29] Persian Music Source Separation in Audio-Visual Data Using Deep Learning
    Hashemi, Seyedeh Sogand
    Aghabozorgi, Masoudreza
    Sadeghi, Mohammad Taghi
    2020 6TH IRANIAN CONFERENCE ON SIGNAL PROCESSING AND INTELLIGENT SYSTEMS (ICSPIS), 2020,
  • [30] Integrating structure and semantics into audio-visual documents
    Troncy, R
    SEMANTIC WEB - ISWC 2003, 2003, 2870 : 566 - 581