A spatio-temporal integrated model based on local and global features for video expression recognition

被引:7
|
作者
Hu, Min [1 ,2 ]
Ge, Peng [1 ,2 ]
Wang, Xiaohua [1 ,2 ]
Lin, Hui [3 ]
Ren, Fuji [2 ,4 ]
机构
[1] Hefei Univ Technol, Key Lab Knowledge Engn Big Data, Minist Educ, Hefei 230601, Peoples R China
[2] Hefei Univ Technol, Sch Comp & Informat, Anhui Prov Key Lab Affect Comp & Adv Intelligent, Hefei 230601, Peoples R China
[3] Hefei Univ Technol, Sch Elect Sci & Applicat Phys, Hefei 230601, Peoples R China
[4] Univ Tokushima, Grad Sch Adv Technol & Sci, Tokushima 7708502, Japan
来源
VISUAL COMPUTER | 2022年 / 38卷 / 08期
基金
中国国家自然科学基金;
关键词
Video expression recognition; Local and global features; Attention mechanism; Feature recalibration; Network integration; NETWORK; SCALE;
D O I
10.1007/s00371-021-02136-z
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Facial expressions can be represented largely by the dynamic variations of important facial expression parts, i.e., eyebrows, eyes, nose, and mouth. The features of these parts are regarded as local features. However, facial global information is also useful for recognition because it is a necessary complement to local features. In this paper, a spatio-temporal integrated model that jointly learns local and global features is proposed for video expression recognition. Firstly, to capture the action of facial key units, a spatio-temporal attention part-gradient-based hierarchical bidirectional recurrent neural network (spatio-temporal attention PGHRNN) is constructed. It can capture the dynamic variations of gradients around facial landmark points. In addition, a new kind of spatial attention mechanism is introduced to recalibrate the features of facial various parts adaptively. Secondly, to complement the local features extracted by the spatio-temporal attention PGHRNN, a squeeze-and-excitation residual network of 50 layers with long short-term memory network (SE-ResNet-50-LSTM) is used as a global feature extractor and classifier. Finally, to integrate the local and global features and improve the performance of facial expression recognition, a joint adaptive fine-tuning method (JAFTM) is proposed to combine the two networks, which can adaptively adjust the network weights. Extensive experiments demonstrate that our proposed model can achieve a superior recognition accuracy of 98.95% on CK + for 7-class facial expressions and 85.40% on MMI database, which outperforms other state-of-the-art methods.
引用
收藏
页码:2617 / 2634
页数:18
相关论文
共 50 条
  • [21] Action Recognition via an Improved Local Descriptor for Spatio-temporal Features
    Yang, Kai
    Du, Ji-Xiang
    Zhai, Chuan-Min
    [J]. ADVANCED INTELLIGENT COMPUTING, 2011, 6838 : 234 - 241
  • [22] Action recognition using spatio-temporal regularity based features
    Goodhart, Taylor
    Yan, Pingkun
    Shah, Mubarak
    [J]. 2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, : 745 - 748
  • [23] Spatio-temporal convolutional features with nested LSTM for facial expression recognition
    Yu, Zhenbo
    Liu, Guangcan
    Liu, Qingshan
    Deng, Jiankang
    [J]. NEUROCOMPUTING, 2018, 317 : 50 - 57
  • [24] Action Recognition Based on Local Spatio-temporal Oriented Energy Features and Additive Kernel SVM
    Cao Qingnian
    Jiang Yuanyuan
    [J]. 2014 FIFTH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS DESIGN AND ENGINEERING APPLICATIONS (ISDEA), 2014, : 118 - 122
  • [25] Micro-Expression Recognition by Aggregating Local Spatio-Temporal Patterns
    Zhang, Shiyu
    Feng, Bailan
    Chen, Zhineng
    Huang, Xiangsheng
    [J]. MULTIMEDIA MODELING (MMM 2017), PT I, 2017, 10132 : 638 - 648
  • [26] Global-local spatio-temporal graph convolutional networks for video summarization
    Wu, Guangli
    Song, Shanshan
    Zhang, Jing
    [J]. COMPUTERS & ELECTRICAL ENGINEERING, 2024, 118
  • [27] Video Action Recognition Based on Spatio-temporal Feature Pyramid Module
    Gong, Suming
    Chen, Ying
    [J]. 2020 13TH INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DESIGN (ISCID 2020), 2020, : 338 - 341
  • [28] Deep Learning Based Video Spatio-Temporal Modeling for Emotion Recognition
    Fonnegra, Ruben D.
    Diaz, Gloria M.
    [J]. HUMAN-COMPUTER INTERACTION: THEORIES, METHODS, AND HUMAN ISSUES, HCI INTERNATIONAL 2018, PT I, 2018, 10901 : 397 - 408
  • [29] Video Copy Detection Using Histogram Based Spatio-temporal Features
    Lee, Feifei
    Zhao, Junjie
    Kotani, Koji
    Chen, Qiu
    [J]. 2017 10TH INTERNATIONAL CONGRESS ON IMAGE AND SIGNAL PROCESSING, BIOMEDICAL ENGINEERING AND INFORMATICS (CISP-BMEI), 2017,
  • [30] 4-Dimensional Local Spatio-Temporal Features for Human Activity Recognition
    Zhang, Hao
    Parker, Lynne E.
    [J]. 2011 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, 2011,