A developmental model of audio-visual attention (MAVA) for bimodal language learning in infants and robots

被引：0

作者：

Bergoin, Raphael ^{[1
]}

Boucenna, Sofiane ^{[1
]}

D'Urso, Raphael ^{[1
]}

Cohen, David ^{[2
,3
]}

Pitti, Alexandre ^{[1
]}

机构：

[1] CY Cergy Paris Univ, ENSEA, CNRS, ETIS,UMR 8051, Cergy Pontoise, France

[2] Hop La Pitie Salpetriere, AP HP, Serv Psychiat Enfant & Adolescent, Paris, France

[3] Univ Pierre & Marie Curie Paris, Inst Syst Intelligents & Robot, Paris, France

来源：

SCIENTIFIC REPORTS | 2024年 / 14卷 / 01期

关键词：

VISUAL-ATTENTION; TALKING-FACE; SYNCHRONY; PERCEPTION; SPEECH; OBJECT; EYES;

D O I：

10.1038/s41598-024-69245-2

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

A social individual needs to effectively manage the amount of complex information in his or her environment relative to his or her own purpose to obtain relevant information. This paper presents a neural architecture aiming to reproduce attention mechanisms (alerting/orienting/selecting) that are efficient in humans during audiovisual tasks in robots. We evaluated the system based on its ability to identify relevant sources of information on faces of subjects emitting vowels. We propose a developmental model of audio-visual attention (MAVA) combining Hebbian learning and a competition between saliency maps based on visual movement and audio energy. MAVA effectively combines bottom-up and top-down information to orient the system toward pertinent areas. The system has several advantages, including online and autonomous learning abilities, low computation time and robustness to environmental noise. MAVA outperforms other artificial models for detecting speech sources under various noise conditions.

引用

页数：9

共 50 条

[1] Learning Bimodal Structure in Audio-Visual Data
Monaci, Gianluca
Vandergheynst, Pierre
Sommer, Friedrich T.
IEEE TRANSACTIONS ON NEURAL NETWORKS, 2009, 20 (12): : 1898 - 1910
[2] An Audio-Visual Attention System for Online Association Learning
Heckmann, Martin
Brandl, Holger
Domont, Xavier
Bolder, Bram
Joublin, Frank
Goerick, Christian
INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 2127 - 2130
[3] SpeechIndexer: A Flexible Software for Audio-Visual Language Learning
Glavitsch, Ulrike
Simon, Klaus
Szakos, Jozsef
ICEIC 2011/ IRE&PS 2011: INTERNATIONAL CONFERENCE ON EDUCATION, INFORMATICS, AND CYBERNETICS/ INTERNATIONAL SYMPOSIUM ON INTEGRATING RESEARCH, EDUCATION, AND PROBLEM SOLVING, 2011, : 79 - 82
[4] Audio-visual modeling for bimodal speech recognition
Kaynak, MN
Zhi, Q
Cheok, AD
Sengupta, K
Chung, KC
2001 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5: E-SYSTEMS AND E-MAN FOR CYBERNETICS IN CYBERSPACE, 2002, : 181 - 186
[5] Bimodal fusion in audio-visual speech recognition
Zhang, XZ
Mersereau, RM
Clements, M
2002 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL I, PROCEEDINGS, 2002, : 964 - 967
[6] Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language
Mercea, Otniel-Bogdan
Riesch, Lukas
Koepke, A. Sophia
Akata, Zeynep
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10543 - 10553
[7] Does attention influence audio-visual neural interactions during bimodal object recognition?
Fort, A
Giard-Steiner, MH
JOURNAL OF COGNITIVE NEUROSCIENCE, 2002, : 68 - 68
[8] Audio-visual speech processing and attention
Sams, M
PSYCHOPHYSIOLOGY, 2003, 40 : S5 - S6
[9] Support system for making audio-visual material for learning language
Tobe, Yuichi
Fujita, Shinichi
Hosaka, Toshiko
2006 7TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY BASED HIGHER EDUCATION AND TRAINING, VOLS 1 AND 2, 2006, : 199 - 202
[10] Audio-Visual Salieny Network with Audio Attention Module
Cheng, Shuaiyang
Gao, Xing
Song, Liang
Xiahou, Jianbing
PROCEEDINGS OF 2021 2ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND INFORMATION SYSTEMS (ICAIIS '21), 2021,

← 1 2 3 4 5 →