Multimodal Vision Transformers with Forced Attention for Behavior Analysis

被引:4
|
作者
Agrawal, Tanay [1 ]
Balazia, Michal [1 ]
Muller, Philipp [2 ]
Bremond, Francois [1 ]
机构
[1] INRIA, Valbonne, France
[2] DFKI, Saarbrucken, Germany
关键词
PERSONALITY; JUDGMENTS;
D O I
10.1109/WACV56688.2023.00339
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Human behavior understanding requires looking at minute details in the large context of a scene containing multiple input modalities. It is necessary as it allows the design of more human-like machines. While transformer approaches have shown great improvements, they face multiple challenges such as lack of data or background noise. To tackle these, we introduce the Forced Attention (FAt) Transformer which utilize forced attention with a modified backbone for input encoding and a use of additional inputs. In addition to improving the performance on different tasks and inputs, the modification requires less time and memory resources. We provide a model for a generalised feature extraction for tasks concerning social signals and behavior analysis. Our focus is on understanding behavior in videos where people are interacting with each other or talking into the camera which simulates the first person point of view in social interaction. FAt Transformers are applied to two downstream tasks: personality recognition and body language recognition. We achieve state-of-the-art results for Udiva v0.5, First Impressions v2 and MPII Group Interaction datasets. We further provide an extensive ablation study of the proposed architecture.
引用
收藏
页码:3381 / 3391
页数:11
相关论文
共 50 条
  • [1] Vision Transformers with Hierarchical Attention
    Liu, Yun
    Wu, Yu-Huan
    Sun, Guolei
    Zhang, Le
    Chhatkuli, Ajad
    Van Gool, Luc
    MACHINE INTELLIGENCE RESEARCH, 2024, 21 (04) : 670 - 683
  • [2] Constituent Attention for Vision Transformers
    Li, Haoling
    Xue, Mengqi
    Song, Jie
    Zhang, Haofei
    Huang, Wenqi
    Liang, Lingyu
    Song, Mingli
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2023, 237
  • [3] Multimodal Token Fusion for Vision Transformers
    Wang, Yikai
    Chen, Xinghao
    Cao, Lele
    Huang, Wenbing
    Sun, Fuchun
    Wang, Yunhe
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 12176 - 12185
  • [4] Robustifying Token Attention for Vision Transformers
    Guo, Yong
    Stutz, David
    Schiele, Bernt
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 17511 - 17522
  • [5] Efficient Vision Transformers with Partial Attention
    Vo, Xuan-Thuy
    Nguyen, Duy-Linh
    Priadana, Adri
    Jo, Kang-Hyun
    COMPUTER VISION - ECCV 2024, PT LXXXIII, 2025, 15141 : 298 - 317
  • [6] Fast Vision Transformers with HiLo Attention
    Pan, Zizheng
    Cai, Jianfei
    Zhuang, Bohan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [7] DaViT: Dual Attention Vision Transformers
    Ding, Mingyu
    Xiao, Bin
    Codella, Noel
    Luo, Ping
    Wang, Jingdong
    Yuan, Lu
    COMPUTER VISION, ECCV 2022, PT XXIV, 2022, 13684 : 74 - 92
  • [8] ViTDroid: Vision Transformers for Efficient, Explainable Attention to Malicious Behavior in Android Binaries
    Syed, Toqeer Ali
    Nauman, Mohammad
    Khan, Sohail
    Jan, Salman
    Zuhairi, Megat F.
    SENSORS, 2024, 24 (20)
  • [9] Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers
    Sahiner, Arda
    Ergen, Tolga
    Ozturkler, Batu
    Pauly, John
    Mardani, Morteza
    Pilanci, Mert
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022, : 19050 - 19088
  • [10] AttnZero: Efficient Attention Discovery for Vision Transformers
    Li, Lujun
    Wei, Zimian
    Dong, Peijie
    Luo, Wenhan
    Xue, Wei
    Liu, Qifeng
    Guo, Yike
    COMPUTER VISION - ECCV 2024, PT V, 2025, 15063 : 20 - 37