Learning Bimodal Structure in Audio-Visual Data

被引：25

作者：

Monaci, Gianluca ^{[1
]}

Vandergheynst, Pierre ^{[2
]}

Sommer, Friedrich T. ^{[1
]}

机构：

[1] Univ Calif Berkeley, Redwood Ctr Theoret Neurosci, Berkeley, CA 94720 USA

[2] Ecole Polytech Fed Lausanne, Inst Elect Engn, CH-1015 Lausanne, Switzerland

来源：

IEEE TRANSACTIONS ON NEURAL NETWORKS | 2009年 / 20卷 / 12期

基金：

瑞士国家科学基金会; 美国国家科学基金会;

关键词：

Audio-visual source localization; dictionary learning; matching pursuit (MP); multimodal data processing; sparse representation; SOURCE SEPARATION; SPARSE; REPRESENTATIONS; APPROXIMATION; RECOGNITION; EXTRACTION; SEQUENCES; SOUNDS; LEVEL;

D O I：

10.1109/TNN.2009.2032182

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

A novel model is presented to learn bimodally informative structures from audio-visual signals. The signal is represented as a sparse sum of audio-visual kernels. Each kernel is a bimodal function consisting of synchronous snippets of an audio waveform and a spatio-temporal visual basis function. To represent an audio-visual signal, the kernels can be positioned independently and arbitrarily in space and time. The proposed algorithm uses unsupervised learning to form dictionaries of bimodal kernels from audio-visual material. The basis functions that emerge during learning capture salient audio-visual data structures. In addition, it is demonstrated that the learned dictionary can be used to locate sources of sound in the movie frame. Specifically, in sequences containing two speakers, the algorithm can robustly localize a speaker even in the presence of severe acoustic and visual distracters.

引用

页码：1898 / 1910

页数：13

共 50 条

[31] Improving speech embedding using crossmodal transfer learning with audio-visual data
Nam Le
Jean-Marc Odobez
Multimedia Tools and Applications, 2019, 78 : 15681 - 15704
[32] Improving speech embedding using crossmodal transfer learning with audio-visual data
Nam Le
Odobez, Jean-Marc
MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (11) : 15681 - 15704
[33] Boosting and structure learning in dynamic Bayesian networks for audio-visual speaker detection
Choudhury, T
Rehg, JM
Pavlovic, V
Pentland, A
16TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL III, PROCEEDINGS, 2002, : 789 - 794
[34] Self-Supervised Learning of Audio Representations From Audio-Visual Data Using Spatial Alignment
Wang, Shanshan
Politis, Archontis
Mesaros, Annamaria
Virtanen, Tuomas
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1467 - 1479
[35] An audio-visual speech recognition with a new mandarin audio-visual database
Liao, Wen-Yuan
Pao, Tsang-Long
Chen, Yu-Te
Chang, Tsun-Wei
INT CONF ON CYBERNETICS AND INFORMATION TECHNOLOGIES, SYSTEMS AND APPLICATIONS/INT CONF ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL 1, 2007, : 19 - +
[36] Robust Audio-Visual Speech Synchrony Detection by Generalized Bimodal Linear Prediction
Kumar, Kshitiz
Navratil, Jiri
Marcheret, Etienne
Libal, Vit
Potamianos, Gerasimos
INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 2219 - +
[37] Human Audio-Visual Consonant Recognition Analyzed with Three Bimodal Integration Models
Ma, Zhanyu
Leijon, Arne
INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 820 - 823
[38] Audio-Visual Speech Synchronization Detection Using a Bimodal Linear Prediction Model
Kumar, Kshitiz
Navratil, Jiri
Marcheret, Etienne
Libal, Vit
Ramaswamy, Ganesh
Potamianos, Gerasimos
2009 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPR WORKSHOPS 2009), VOLS 1 AND 2, 2009, : 670 - +
[39] Noisy audio feature enhancement using audio-visual speech data
Goecke, R
Potamianos, G
Neti, C
2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 2025 - 2028
[40] AUDIO-VISUAL EDUCATION
Brickman, William W.
SCHOOL AND SOCIETY, 1948, 67 (1739): : 320 - 326

← 1 2 3 4 5 →