Enhanced Hybrid Vision Transformer with Multi-Scale Feature Integration and Patch Dropping for Facial Expression Recognition

被引:1
|
作者
Li, Nianfeng [1 ]
Huang, Yongyuan [1 ]
Wang, Zhenyan [1 ]
Fan, Ziyao [1 ]
Li, Xinyuan [1 ]
Xiao, Zhiguo [1 ,2 ]
机构
[1] Changchun Univ, Coll Food Sci & Engn, 6543 Satellite Rd, Changchun 130022, Peoples R China
[2] Beijing Inst Technol, Sch Comp Sci Technol, Beijing 100811, Peoples R China
关键词
facial expression recognition; lightweight network; attention module; transformer; CONVOLUTIONAL NEURAL-NETWORK; ATTENTION;
D O I
10.3390/s24134153
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Convolutional neural networks (CNNs) have made significant progress in the field of facial expression recognition (FER). However, due to challenges such as occlusion, lighting variations, and changes in head pose, facial expression recognition in real-world environments remains highly challenging. At the same time, methods solely based on CNN heavily rely on local spatial features, lack global information, and struggle to balance the relationship between computational complexity and recognition accuracy. Consequently, the CNN-based models still fall short in their ability to address FER adequately. To address these issues, we propose a lightweight facial expression recognition method based on a hybrid vision transformer. This method captures multi-scale facial features through an improved attention module, achieving richer feature integration, enhancing the network's perception of key facial expression regions, and improving feature extraction capabilities. Additionally, to further enhance the model's performance, we have designed the patch dropping (PD) module. This module aims to emulate the attention allocation mechanism of the human visual system for local features, guiding the network to focus on the most discriminative features, reducing the influence of irrelevant features, and intuitively lowering computational costs. Extensive experiments demonstrate that our approach significantly outperforms other methods, achieving an accuracy of 86.51% on RAF-DB and nearly 70% on FER2013, with a model size of only 3.64 MB. These results demonstrate that our method provides a new perspective for the field of facial expression recognition.
引用
收藏
页数:18
相关论文
共 50 条
  • [21] A fish counting model based on pyramid vision transformer with multi-scale feature enhancement
    Xin, Jiaming
    Wang, Yiying
    Li, Dashe
    Xiang, Zhongliang
    ECOLOGICAL INFORMATICS, 2025, 86
  • [22] Matching Multi-Scale Feature Sets in Vision Transformer for Few-Shot Classification
    Song, Mingchen
    Yao, Fengqin
    Zhong, Guoqiang
    Ji, Zhong
    Zhang, Xiaowei
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (12) : 12638 - 12651
  • [23] Facial Expression Recognition by Multi-Scale CNN with Regularized Center Loss
    Li, Zhenghao
    Wu, Song
    Xiao, Guoqiang
    2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2018, : 3384 - 3389
  • [24] Multi-Scale Integrated Attention Mechanism for Facial Expression Recognition Network
    Luo, Sishi
    Li, Maojun
    Chen, Man
    Computer Engineering and Applications, 2023, 59 (01): : 199 - 206
  • [25] Facial Expression Recognition Based on Squeeze Vision Transformer
    Kim, Sangwon
    Nam, Jaeyeal
    Ko, Byoung Chul
    SENSORS, 2022, 22 (10)
  • [26] Facial Expression Recognition Method Based on Multi-scale Detail Enhancement
    Tan X.
    Li Z.
    Fan Y.
    Dianzi Yu Xinxi Xuebao/Journal of Electronics and Information Technology, 2019, 41 (11): : 2752 - 2759
  • [27] Facial Expression Recognition Method Based on Multi-scale Detail Enhancement
    Tan Xiaohui
    Li Zhaowei
    Fan Yachun
    JOURNAL OF ELECTRONICS & INFORMATION TECHNOLOGY, 2019, 41 (11) : 2752 - 2759
  • [28] Multi-Scale Coordinate Attention Pyramid Convolution for Facial Expression Recognition
    Ni, Jinyuan
    Zhang, Jianxun
    Computer Engineering and Applications, 2023, 59 (22) : 242 - 250
  • [29] Facial Micro-Expression Recognition Enhanced by Score Fusion and a Hybrid Model from Convolutional LSTM and Vision Transformer
    Zheng, Yufeng
    Blasch, Erik
    SENSORS, 2023, 23 (12)
  • [30] Transformer-Based Multi-Scale Feature Integration Network for Video Saliency Prediction
    Zhou, Xiaofei
    Wu, Songhe
    Shi, Ran
    Zheng, Bolun
    Wang, Shuai
    Yin, Haibing
    Zhang, Jiyong
    Yan, Chenggang
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (12) : 7696 - 7707