Multiple Spatio-temporal Feature Learning for Video-based Emotion Recognition in the Wild

被引:47
|
作者
Lu, Cheng [1 ]
Zheng, Wenming [2 ]
Li, Chaolong [3 ]
Tang, Chuangao [3 ]
Liu, Suyuan [3 ]
Yan, Simeng [3 ]
Zong, Yuan [3 ]
机构
[1] Southeast Univ, Sch Informat Sci & Engn, Nanjing, Jiangsu, Peoples R China
[2] Southeast Univ, Sch Biol Sci & Med Engn, Minist Educ, Key Lab Child Dev & Learning Sci, Nanjing, Jiangsu, Peoples R China
[3] Southeast Univ, Sch Biol Sci & Med Engn, Nanjing, Jiangsu, Peoples R China
基金
中国国家自然科学基金;
关键词
Emotion Recognition; Spatio-Temporal Information; Convolutional Neural Networks (CNN); Long Short-Term Memory (LSTM); 3D Convolutional Neural Networks (3D CNN); CLASSIFICATION;
D O I
10.1145/3242969.3264992
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The difficulty of emotion recognition in the wild (EmotiW) is how to train a robust model to deal with diverse scenarios and anomalies. The Audio-video Sub-challenge in EmotiW contains audio video short clips with several emotional labels and the task is to distinguish which label the video belongs to. For the better emotion recognition in videos, we propose a multiple spatio-temporal feature fusion (MSFF) framework, which can more accurately depict emotional information in spatial and temporal dimensions by two mutually complementary sources, including the facial image and audio. The framework is consisted of two parts: the facial image model and the audio model. With respect to the facial image model, three different architectures of spatial-temporal neural networks are employed to extract discriminative features about different emotions in facial expression images. Firstly, the high-level spatial features are obtained by the pre-trained convolutional neural networks (CNN), including VGG-Face and ResNet-50 which are all fed with the images generated by each video. Then, the features of all frames are sequentially input to the Bi-directional Long Short-Term Memory (BLSTM) so as to capture dynamic variations of facial appearance textures in a video. In addition to the structure of CNN-RNN, another spatio-temporal network, namely deep 3-Dimensional Convolutional Neural Networks (3D CNN) by extending the 2D convolution kernel to 3D, is also applied to attain evolving emotional information encoded in multiple adjacent frames. For the audio model, the spectrogram images of speech generated by preprocessing audio, are also modeled in a VGG-BLSTM framework to characterize the affective fluctuation more efficiently. Finally, a fusion strategy with the score matrices of different spatiotemporal networks gained from the above framework is proposed to boost the performance of emotion recognition complementally. Extensive experiments show that the overall accuracy of our proposed MSFF is 60.64%, which achieves a large improvement compared with the baseline and outperform the result of champion team in 2017.
引用
收藏
页码:646 / 652
页数:7
相关论文
共 50 条
  • [1] Video-based driver emotion recognition using hybrid deep spatio-temporal feature learning
    Varma, Harshit
    Ganapathy, Nagarajan
    Deserno, Thomas M.
    [J]. MEDICAL IMAGING 2022: IMAGING INFORMATICS FOR HEALTHCARE, RESEARCH, AND APPLICATIONS, 2022, 12037
  • [2] Video-based Emotion Recognition using Aggregated Features and Spatio-temporal Information
    Xu, Jinchang
    Dong, Yuan
    Ma, Lilei
    Bai, Hongliang
    [J]. 2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2018, : 2833 - 2838
  • [3] Deep Learning Based Video Spatio-Temporal Modeling for Emotion Recognition
    Fonnegra, Ruben D.
    Diaz, Gloria M.
    [J]. HUMAN-COMPUTER INTERACTION: THEORIES, METHODS, AND HUMAN ISSUES, HCI INTERNATIONAL 2018, PT I, 2018, 10901 : 397 - 408
  • [4] Spatio-temporal keypoints for video-based face recognition
    Franco, A.
    Maio, D.
    Turroni, F.
    [J]. 2014 22ND INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2014, : 489 - 494
  • [5] Spatio-Temporal Encoder-Decoder Fully Convolutional Network for Video-Based Dimensional Emotion Recognition
    Du, Zhengyin
    Wu, Suowei
    Huang, Di
    Li, Weixin
    Wang, Yunhong
    [J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2021, 12 (03) : 565 - 578
  • [6] Video Action Recognition Based on Spatio-temporal Feature Pyramid Module
    Gong, Suming
    Chen, Ying
    [J]. 2020 13TH INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DESIGN (ISCID 2020), 2020, : 338 - 341
  • [7] Deep Spatio-Temporal Mutual Learning for EEG Emotion Recognition
    Ye, Wenqing
    Li, Xinyu
    Zhang, Haokun
    Zhu, Zhuolin
    Li, Dongdong
    [J]. 2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [8] A multiple feature fusion framework for video emotion recognition in the wild
    Samadiani, Najmeh
    Huang, Guangyan
    Luo, Wei
    Chi, Chi-Hung
    Shu, Yanfeng
    Wang, Rui
    Kocaturk, Tuba
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2022, 34 (08):
  • [9] A Spatio-Temporal Attentive Network for Video-Based Crowd Counting
    Avvenuti, Marco
    Bongiovanni, Marco
    Ciampi, Luca
    Falchi, Fabrizio
    Gennaro, Claudio
    Messina, Nicola
    [J]. 2022 27TH IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS (IEEE ISCC 2022), 2022,
  • [10] Learning Feature Semantic Matching for Spatio-Temporal Video Grounding
    Zhang, Tong
    Fang, Hao
    Zhang, Hao
    Gao, Jialin
    Lu, Xiankai
    Nie, Xiushan
    Yin, Yilong
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9268 - 9279