A Neural Network Architecture for Children's Audio-Visual Emotion Recognition

被引：1

作者：

Matveev, Anton ^{[1
]}

Matveev, Yuri ^{[1
]}

Frolova, Olga ^{[1
]}

Nikolaev, Aleksandr ^{[1
]}

Lyakso, Elena ^{[1
]}

机构：

[1] St Petersburg Univ, Dept Higher Nervous Act & Psychophysiol, Child Speech Res Grp, St Petersburg 199034, Russia

来源：

MATHEMATICS | 2023年 / 11卷 / 22期

基金：

俄罗斯科学基金会;

关键词：

audio-visual speech; emotion recognition; children; MULTIMODAL FUSION; SPEECH; AGE;

D O I：

10.3390/math11224573

中图分类号：

O1 [数学];

学科分类号：

0701 ; 070101 ;

摘要：

Detecting and understanding emotions are critical for our daily activities. As emotion recognition (ER) systems develop, we start looking at more difficult cases than just acted adult audio-visual speech. In this work, we investigate the automatic classification of the audio-visual emotional speech of children, which presents several challenges including the lack of publicly available annotated datasets and the low performance of the state-of-the art audio-visual ER systems. In this paper, we present a new corpus of children's audio-visual emotional speech that we collected. Then, we propose a neural network solution that improves the utilization of the temporal relationships between audio and video modalities in the cross-modal fusion for children's audio-visual emotion recognition. We select a state-of-the-art neural network architecture as a baseline and present several modifications focused on a deeper learning of the cross-modal temporal relationships using attention. By conducting experiments with our proposed approach and the selected baseline model, we observe a relative improvement in performance by 2%. Finally, we conclude that focusing more on the cross-modal temporal relationships may be beneficial for building ER systems for child-machine communications and environments where qualified professionals work with children.

引用

页数：17

共 50 条

[1] Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition
Zhang, Shiqing
Zhang, Shiliang
Huang, Tiejun
Gao, Wen
ICMR'16: PROCEEDINGS OF THE 2016 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2016, : 281 - 284
[2] Audio-visual spontaneous emotion recognition
Zeng, Zhihong
Hu, Yuxiao
Roisman, Glenn I.
Wen, Zhen
Fu, Yun
Huang, Thomas S.
ARTIFICIAL INTELLIGENCE FOR HUMAN COMPUTING, 2007, 4451 : 72 - +
[3] A Deep Neural Network for Audio-Visual Person Recognition
Alam, Mohammad Rafiqul
Bennamoun, Mohammed
Togneri, Roberto
Sohel, Ferdous
2015 IEEE 7TH INTERNATIONAL CONFERENCE ON BIOMETRICS THEORY, APPLICATIONS AND SYSTEMS (BTAS 2015), 2015,
[4] Audio-Visual Fusion Network Based on Conformer for Multimodal Emotion Recognition
Guo, Peini
Chen, Zhengyan
Li, Yidi
Liu, Hong
ARTIFICIAL INTELLIGENCE, CICAI 2022, PT II, 2022, 13605 : 315 - 326
[5] Audio-Visual Learning for Multimodal Emotion Recognition
Fan, Siyu
Jing, Jianan
Wang, Chongwen
SYMMETRY-BASEL, 2025, 17 (03):
[6] Audio-Visual Attention Networks for Emotion Recognition
Lee, Jiyoung
Kim, Sunok
Kim, Seungryong
Sohn, Kwanghoon
AVSU'18: PROCEEDINGS OF THE 2018 WORKSHOP ON AUDIO-VISUAL SCENE UNDERSTANDING FOR IMMERSIVE MULTIMEDIA, 2018, : 27 - 32
[7] Deep operational audio-visual emotion recognition
Akturk, Kaan
Keceli, Ali Seydi
NEUROCOMPUTING, 2024, 588
[8] Audio-Visual Emotion Recognition in Video Clips
Noroozi, Fatemeh
Marjanovic, Marina
Njegus, Angelina
Escalera, Sergio
Anbarjafari, Gholamreza
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2019, 10 (01) : 60 - 75
[9] RECURRENT NEURAL NETWORK TRANSDUCER FOR AUDIO-VISUAL SPEECH RECOGNITION
Makino, Takaki
Liao, Hank
Assael, Yannis
Shillingford, Brendan
Garcia, Basilio
Braga, Otavio
Siohan, Olivier
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 905 - 912
[10] Audio-Visual Emotion Recognition Using a Hybrid Deep Convolutional Neural Network based on Census Transform
Cornejo, Jadisha Yarif Ramirez
Pedrini, Helio
2019 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS (SMC), 2019, : 3396 - 3402

← 1 2 3 4 5 →