A developmental model of audio-visual attention (MAVA) for bimodal language learning in infants and robots

被引:0
|
作者
Bergoin, Raphael [1 ]
Boucenna, Sofiane [1 ]
D'Urso, Raphael [1 ]
Cohen, David [2 ,3 ]
Pitti, Alexandre [1 ]
机构
[1] CY Cergy Paris Univ, ENSEA, CNRS, ETIS,UMR 8051, Cergy Pontoise, France
[2] Hop La Pitie Salpetriere, AP HP, Serv Psychiat Enfant & Adolescent, Paris, France
[3] Univ Pierre & Marie Curie Paris, Inst Syst Intelligents & Robot, Paris, France
来源
SCIENTIFIC REPORTS | 2024年 / 14卷 / 01期
关键词
VISUAL-ATTENTION; TALKING-FACE; SYNCHRONY; PERCEPTION; SPEECH; OBJECT; EYES;
D O I
10.1038/s41598-024-69245-2
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
A social individual needs to effectively manage the amount of complex information in his or her environment relative to his or her own purpose to obtain relevant information. This paper presents a neural architecture aiming to reproduce attention mechanisms (alerting/orienting/selecting) that are efficient in humans during audiovisual tasks in robots. We evaluated the system based on its ability to identify relevant sources of information on faces of subjects emitting vowels. We propose a developmental model of audio-visual attention (MAVA) combining Hebbian learning and a competition between saliency maps based on visual movement and audio energy. MAVA effectively combines bottom-up and top-down information to orient the system toward pertinent areas. The system has several advantages, including online and autonomous learning abilities, low computation time and robustness to environmental noise. MAVA outperforms other artificial models for detecting speech sources under various noise conditions.
引用
收藏
页数:9
相关论文
共 50 条
  • [21] Audio-visual integration during overt visual attention
    Quigley, Cliodhna
    Onat, Selim
    Harding, Sue
    Cooke, Martin
    Koenig, Peter
    JOURNAL OF EYE MOVEMENT RESEARCH, 2007, 1 (02):
  • [22] Audio-Visual Attention Networks for Emotion Recognition
    Lee, Jiyoung
    Kim, Sunok
    Kim, Seungryong
    Sohn, Kwanghoon
    AVSU'18: PROCEEDINGS OF THE 2018 WORKSHOP ON AUDIO-VISUAL SCENE UNDERSTANDING FOR IMMERSIVE MULTIMEDIA, 2018, : 27 - 32
  • [23] Masked co-attention model for audio-visual event localization
    Liu, Hengwei
    Gu, Xiaodong
    APPLIED INTELLIGENCE, 2024, 54 (02) : 1691 - 1705
  • [24] Masked co-attention model for audio-visual event localization
    Hengwei Liu
    Xiaodong Gu
    Applied Intelligence, 2024, 54 : 1691 - 1705
  • [25] VIDEO CODING BASED ON AUDIO-VISUAL ATTENTION
    Lee, Jong-Seok
    De Simone, Francesca
    Ebrahimi, Touradj
    ICME: 2009 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-3, 2009, : 57 - 60
  • [26] AVLnet: Learning Audio-Visual Language Representations from Instructional Videos
    Rouditchenko, Andrew
    Boggust, Angie
    Harwath, David
    Chen, Brian
    Joshi, Dhiraj
    Thomas, Samuel
    Audhkhasi, Kartik
    Kuehne, Hilde
    Panda, Rameswar
    Feris, Rogerio
    Kingsbury, Brian
    Picheny, Michael
    Torralba, Antonio
    Glass, James
    INTERSPEECH 2021, 2021, : 1584 - 1588
  • [27] Bimodal audio-visual training enhances auditory adaptation process
    Kawase, Tetsuaki
    Sakamoto, Shuichi
    Hori, Yoko
    Maki, Atsuko
    Suzuki, Yoiti
    Kobayashi, Toshimitsu
    NEUROREPORT, 2009, 20 (14) : 1231 - 1234
  • [28] Paper: Speaker Localization Based on Audio-Visual Bimodal Fusion
    Zhu, Ying-Xin
    Jin, Hao-Ran
    JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS, 2021, 25 (03) : 375 - 382
  • [29] Bimodal Perception of Audio-Visual Material Properties for Virtual Environments
    Bonneel, Nicolas
    Suied, Clara
    Viaud-Delmon, Isabelle
    Drettakis, George
    ACM TRANSACTIONS ON APPLIED PERCEPTION, 2010, 7 (01)
  • [30] A Biologically Plausible Audio-Visual Integration Model for Continual Learning
    Chen, Wenjie
    Du, Fengtong
    Wang, Ye
    Cao, Lihong
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,