Learning Bimodal Structure in Audio-Visual Data

被引:25
|
作者
Monaci, Gianluca [1 ]
Vandergheynst, Pierre [2 ]
Sommer, Friedrich T. [1 ]
机构
[1] Univ Calif Berkeley, Redwood Ctr Theoret Neurosci, Berkeley, CA 94720 USA
[2] Ecole Polytech Fed Lausanne, Inst Elect Engn, CH-1015 Lausanne, Switzerland
来源
IEEE TRANSACTIONS ON NEURAL NETWORKS | 2009年 / 20卷 / 12期
基金
瑞士国家科学基金会; 美国国家科学基金会;
关键词
Audio-visual source localization; dictionary learning; matching pursuit (MP); multimodal data processing; sparse representation; SOURCE SEPARATION; SPARSE; REPRESENTATIONS; APPROXIMATION; RECOGNITION; EXTRACTION; SEQUENCES; SOUNDS; LEVEL;
D O I
10.1109/TNN.2009.2032182
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A novel model is presented to learn bimodally informative structures from audio-visual signals. The signal is represented as a sparse sum of audio-visual kernels. Each kernel is a bimodal function consisting of synchronous snippets of an audio waveform and a spatio-temporal visual basis function. To represent an audio-visual signal, the kernels can be positioned independently and arbitrarily in space and time. The proposed algorithm uses unsupervised learning to form dictionaries of bimodal kernels from audio-visual material. The basis functions that emerge during learning capture salient audio-visual data structures. In addition, it is demonstrated that the learned dictionary can be used to locate sources of sound in the movie frame. Specifically, in sequences containing two speakers, the algorithm can robustly localize a speaker even in the presence of severe acoustic and visual distracters.
引用
收藏
页码:1898 / 1910
页数:13
相关论文
共 50 条
  • [31] Improving speech embedding using crossmodal transfer learning with audio-visual data
    Nam Le
    Jean-Marc Odobez
    Multimedia Tools and Applications, 2019, 78 : 15681 - 15704
  • [32] Improving speech embedding using crossmodal transfer learning with audio-visual data
    Nam Le
    Odobez, Jean-Marc
    MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (11) : 15681 - 15704
  • [33] Boosting and structure learning in dynamic Bayesian networks for audio-visual speaker detection
    Choudhury, T
    Rehg, JM
    Pavlovic, V
    Pentland, A
    16TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL III, PROCEEDINGS, 2002, : 789 - 794
  • [34] Self-Supervised Learning of Audio Representations From Audio-Visual Data Using Spatial Alignment
    Wang, Shanshan
    Politis, Archontis
    Mesaros, Annamaria
    Virtanen, Tuomas
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1467 - 1479
  • [35] An audio-visual speech recognition with a new mandarin audio-visual database
    Liao, Wen-Yuan
    Pao, Tsang-Long
    Chen, Yu-Te
    Chang, Tsun-Wei
    INT CONF ON CYBERNETICS AND INFORMATION TECHNOLOGIES, SYSTEMS AND APPLICATIONS/INT CONF ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL 1, 2007, : 19 - +
  • [36] Robust Audio-Visual Speech Synchrony Detection by Generalized Bimodal Linear Prediction
    Kumar, Kshitiz
    Navratil, Jiri
    Marcheret, Etienne
    Libal, Vit
    Potamianos, Gerasimos
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 2219 - +
  • [37] Human Audio-Visual Consonant Recognition Analyzed with Three Bimodal Integration Models
    Ma, Zhanyu
    Leijon, Arne
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 820 - 823
  • [38] Audio-Visual Speech Synchronization Detection Using a Bimodal Linear Prediction Model
    Kumar, Kshitiz
    Navratil, Jiri
    Marcheret, Etienne
    Libal, Vit
    Ramaswamy, Ganesh
    Potamianos, Gerasimos
    2009 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPR WORKSHOPS 2009), VOLS 1 AND 2, 2009, : 670 - +
  • [39] Noisy audio feature enhancement using audio-visual speech data
    Goecke, R
    Potamianos, G
    Neti, C
    2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 2025 - 2028
  • [40] AUDIO-VISUAL EDUCATION
    Brickman, William W.
    SCHOOL AND SOCIETY, 1948, 67 (1739): : 320 - 326