Husformer: A Multimodal Transformer for Multimodal Human State Recognition

被引:0
|
作者
Wang, Ruiqi [1 ]
Jo, Wonse [1 ]
Zhao, Dezhong [2 ]
Wang, Weizheng [1 ]
Gupte, Arjun [1 ]
Yang, Baijian [1 ]
Chen, Guohua [2 ]
Min, Byung-Cheol [1 ]
机构
[1] Purdue Univ, Dept Comp & Informat Technol, W Lafayette, IN 47907 USA
[2] Beijing Univ Chem Technol, Coll Mech & Elect Engn, Beijing 100029, Peoples R China
基金
美国国家科学基金会;
关键词
Cognitive load recognition; cross-modal attention; emotion prediction; multimodal fusion; transformer;
D O I
10.1109/TCDS.2024.3357618
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Human state recognition is a critical topic with pervasive and important applications in human-machine systems. Multimodal fusion, which entails integrating metrics from various data sources, has proven to be a potent method for boosting recognition performance. Although recent multimodal-based models have shown promising results, they often fall short in fully leveraging sophisticated fusion strategies essential for modeling adequate cross-modal dependencies in the fusion representation. Instead, they rely on costly and inconsistent feature crafting and alignment. To address this limitation, we propose an end-to-end multimodal transformer framework for multimodal human state recognition called Husformer. Specifically, we propose using cross-modal transformers, which inspire one modality to reinforce itself through directly attending to latent relevance revealed in other modalities, to fuse different modalities while ensuring sufficient awareness of the cross-modal interactions introduced. Subsequently, we utilize a self-attention transformer to further prioritize contextual information in the fusion representation. Extensive experiments on two human emotion corpora (DEAP and WESAD) and two cognitive load datasets [multimodal dataset for objective cognitive workload assessment on simultaneous tasks (MOCAS) and CogLoad] demonstrate that in the recognition of the human state, our Husformer outperforms both state-of-the-art multimodal baselines and the use of a single modality by a large margin, especially when dealing with raw multimodal features. We also conducted an ablation study to show the benefits of each component in Husformer. Experimental details and source code are available at https://github.com/SMARTlab-Purdue/Husformer.
引用
收藏
页码:1374 / 1390
页数:17
相关论文
共 50 条
  • [1] Multimodal Fusion for Human Action Recognition via Spatial Transformer
    Sun, Yaohui
    Xu, Weiyao
    Gao, Ju
    Yu, Xiaoyi
    [J]. 2023 35TH CHINESE CONTROL AND DECISION CONFERENCE, CCDC, 2023, : 1638 - 1641
  • [2] Multimodal Transformer for Nursing Activity Recognition
    Ijaz, Momal
    Diaz, Renato
    Chen, Chen
    [J]. arXiv, 2022,
  • [3] Multimodal Transformer for Nursing Activity Recognition
    Ijaz, Momal
    Diaz, Renato
    Chen, Chen
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 2064 - 2073
  • [4] Multi-agent Transformer Networks for Multimodal Human Activity Recognition
    Li, Jingcheng
    Yao, Lina
    Li, Binghao
    Wang, Xianzhi
    Sammut, Claude
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022, 2022, : 1135 - 1145
  • [5] Cross-scale cascade transformer for multimodal human action recognition
    Liu, Zhen
    Cheng, Qin
    Song, Chengqun
    Cheng, Jun
    [J]. PATTERN RECOGNITION LETTERS, 2023, 168 : 17 - 23
  • [6] MULTIMODAL TRANSFORMER FUSION FOR CONTINUOUS EMOTION RECOGNITION
    Huang, Jian
    Tao, Jianhua
    Liu, Bin
    Lian, Zheng
    Niu, Mingyue
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 3507 - 3511
  • [7] Pedestrian Attribute Recognition Based on Multimodal Transformer
    Liu, Dan
    Song, Wei
    Zhao, Xiaobing
    [J]. PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT I, 2024, 14425 : 422 - 433
  • [8] Multimodal Transformer for Multimodal Machine Translation
    Yao, Shaowei
    Wan, Xiaojun
    [J]. 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 4346 - 4350
  • [9] Building Robust Multimodal Sentiment Recognition via a Simple yet Effective Multimodal Transformer
    Zong, Daoming
    Ding, Chaoyue
    Li, Baoxiang
    Zhou, Dinghao
    Li, Jiakui
    Zheng, Ken
    Zhou, Qunyan
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 9596 - 9600
  • [10] MGRFormer: A Multimodal Transformer Approach for Surgical Gesture Recognition
    Feghoul, Kevin
    Maia, Deise Santana
    El Amrani, Mehdi
    Daoudi, Mohamed
    Amad, Ali
    [J]. 2024 IEEE 18TH INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION, FG 2024, 2024,