A joint hierarchical cross-attention graph convolutional network for multi-modal facial expression recognition

被引:1
|
作者
Xu, Chujie [1 ]
Du, Yong [1 ]
Wang, Jingzi [2 ]
Zheng, Wenjie [1 ]
Li, Tiejun [1 ]
Yuan, Zhansheng [1 ]
机构
[1] Jimei Univ, Sch Ocean Informat Engn, Xiamen, Peoples R China
[2] Natl Chengchi Univ, Dept Comp Sci, Chengchi, Taiwan
关键词
cross-attention mechanism; emotional recognition in conversations; graph convolution network; IoT; multi-modal fusion; transformer; EMOTION RECOGNITION; VALENCE;
D O I
10.1111/coin.12607
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Emotional recognition in conversations (ERC) is increasingly being applied in various IoT devices. Deep learning-based multimodal ERC has achieved great success by leveraging diverse and complementary modalities. Although most existing methods try to adopt attention mechanisms to fuse different information, these methods ignore the complementarity between modalities. To this end, the joint cross-attention model is introduced to alleviate this issue. However, multi-scale feature information on different modalities is not utilized. Moreover, the context relationship plays an important role in feature extraction in the expression recognition task. In this paper, we propose a novel joint hierarchical graph convolution network (JHGCN) which exploits different layer features and context relationships for facial expression recognition based on audio-visual (A-V) information. Specifically, we adopt different deep networks to extract features from different modalities individually. For V modality, we construct V graph data based on patch embeddings which are extracted from the transformer encoder. Moreover, we embed the graph convolution which can leverage the intra-modality relationships with the transformer encoder. Then, the deep feature from different layers is fed to the hierarchical fusion module to enhance feature representation. At last, we use the joint cross-attention mechanism to exploit the complementary inter-modality relationships. To validate the proposed model, we have conducted various experiments on the AffWild2 and CMU-MOSI datasets. All results confirm that our proposed model achieves highly promising performance compared to the joint cross-attention model and other methods.
引用
收藏
页数:18
相关论文
共 50 条
  • [1] Dense Graph Convolutional With Joint Cross-Attention Network for Multimodal Emotion Recognition
    Cheng, Cheng
    Liu, Wenzhe
    Feng, Lin
    Jia, Ziyu
    [J]. IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2024,
  • [2] IS CROSS-ATTENTION PREFERABLE TO SELF-ATTENTION FOR MULTI-MODAL EMOTION RECOGNITION?
    Rajan, Vandana
    Brutti, Alessio
    Cavallaro, Andrea
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4693 - 4697
  • [3] A hierarchical multi-modal cross-attention model for face anti-spoofing
    Xue, Hao
    Ma, Jing
    Guo, Xiaoyu
    [J]. JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2023, 97
  • [4] Multi-Modal Recurrent Attention Networks for Facial Expression Recognition
    Lee, Jiyoung
    Kim, Sunok
    Kim, Seungryong
    Sohn, Kwanghoon
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 6977 - 6991
  • [5] Multi-modal cross-attention network for Alzheimer's disease diagnosis with multi data
    Zhang, Jin
    He, Xiaohai
    Liu, Yan
    Cai, Qingyan
    Chen, Honggang
    Qing, Linbo
    [J]. COMPUTERS IN BIOLOGY AND MEDICINE, 2023, 162
  • [6] Multi-Level Multi-Modal Cross-Attention Network for Fake News Detection
    Ying, Long
    Yu, Hui
    Wang, Jinguang
    Ji, Yongze
    Qian, Shengsheng
    [J]. IEEE ACCESS, 2021, 9 : 132363 - 132373
  • [7] Adversarial Graph Attention Network for Multi-modal Cross-modal Retrieval
    Wu, Hongchang
    Guan, Ziyu
    Zhi, Tao
    zhao, Wei
    Xu, Cai
    Han, Hong
    Yang, Yarning
    [J]. 2019 10TH IEEE INTERNATIONAL CONFERENCE ON BIG KNOWLEDGE (ICBK 2019), 2019, : 265 - 272
  • [8] Multi-Modal Sarcasm Detection via Cross-Modal Graph Convolutional Network
    Liang, Bin
    Lou, Chenwei
    Li, Xiang
    Yang, Min
    Gui, Lin
    He, Yulan
    Pei, Wenjie
    Xu, Ruifeng
    [J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 1767 - 1777
  • [9] VioNets: efficient multi-modal fusion method based on bidirectional gate recurrent unit and cross-attention graph convolutional network for video violence detection
    Liang, Wuyan
    Xu, Xiaolong
    Fu, Xiao
    [J]. JOURNAL OF ELECTRONIC IMAGING, 2023, 32 (02)
  • [10] Cross-Modal Attention-Guided Convolutional Network for Multi-modal Cardiac Segmentation
    Zhou, Ziqi
    Guo, Xinna
    Yang, Wanqi
    Shi, Yinghuan
    Zhou, Luping
    Wang, Lei
    Yang, Ming
    [J]. MACHINE LEARNING IN MEDICAL IMAGING (MLMI 2019), 2019, 11861 : 601 - 610