Multimodal Vision Transformers with Forced Attention for Behavior Analysis

被引:4
|
作者
Agrawal, Tanay [1 ]
Balazia, Michal [1 ]
Muller, Philipp [2 ]
Bremond, Francois [1 ]
机构
[1] INRIA, Valbonne, France
[2] DFKI, Saarbrucken, Germany
关键词
PERSONALITY; JUDGMENTS;
D O I
10.1109/WACV56688.2023.00339
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Human behavior understanding requires looking at minute details in the large context of a scene containing multiple input modalities. It is necessary as it allows the design of more human-like machines. While transformer approaches have shown great improvements, they face multiple challenges such as lack of data or background noise. To tackle these, we introduce the Forced Attention (FAt) Transformer which utilize forced attention with a modified backbone for input encoding and a use of additional inputs. In addition to improving the performance on different tasks and inputs, the modification requires less time and memory resources. We provide a model for a generalised feature extraction for tasks concerning social signals and behavior analysis. Our focus is on understanding behavior in videos where people are interacting with each other or talking into the camera which simulates the first person point of view in social interaction. FAt Transformers are applied to two downstream tasks: personality recognition and body language recognition. We achieve state-of-the-art results for Udiva v0.5, First Impressions v2 and MPII Group Interaction datasets. We further provide an extensive ablation study of the proposed architecture.
引用
收藏
页码:3381 / 3391
页数:11
相关论文
共 50 条
  • [21] Twins: Revisiting the Design of Spatial Attention in Vision Transformers
    Chu, Xiangxiang
    Tian, Zhi
    Wang, Yuqing
    Zhang, Bo
    Ren, Haibing
    Wei, Xiaolin
    Xia, Huaxia
    Shen, Chunhua
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
  • [22] ResViT: Residual Vision Transformers for Multimodal Medical Image Synthesis
    Dalmaz, Onat
    Yurt, Mahmut
    Cukur, Tolga
    IEEE TRANSACTIONS ON MEDICAL IMAGING, 2022, 41 (10) : 2598 - 2614
  • [23] From Multimodal to Unimodal Attention in Transformers using Knowledge Distillation
    Agarwal, Dhruv
    Agrawal, Tanay
    Ferrari, Laura M.
    Bremond, Francois
    2021 17TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS 2021), 2021,
  • [24] Are You Paying Attention? Multimodal Linear Attention Transformers for Affect Prediction in Video Conversations
    Poh, Jia Qing
    See, John
    El Gayar, Neamat
    Wong, Lai-Kuan
    PROCEEDINGS OF THE 2ND INTERNATIONAL WORKSHOP ON MULTIMODAL AND RESPONSIBLE AFFECTIVE COMPUTING, MRAC 2024, 2024, : 15 - 23
  • [25] Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets
    Chen, Xiangyu
    Hu, Qinghao
    Li, Kaidong
    Zhong, Cuncong
    Wang, Guanghui
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 3973 - 3981
  • [26] An Attention-Based Token Pruning Method for Vision Transformers
    Luo, Kaicheng
    Li, Huaxiong
    Zhou, Xianzhong
    Huang, Bing
    ROUGH SETS, IJCRS 2022, 2022, 13633 : 274 - 288
  • [27] RAWAtten: Reconfigurable Accelerator for Window Attention in Hierarchical Vision Transformers
    Li, Wantong
    Luo, Yandong
    Yu, Shimeng
    2023 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION, DATE, 2023,
  • [28] Focal Attention for Long-Range Interactions in Vision Transformers
    Yang, Jianwei
    Li, Chunyuan
    Zhang, Pengchuan
    Dai, Xiyang
    Xiao, Bin
    Yuan, Lu
    Gao, Jianfeng
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [29] Are Vision-Language Transformers Learning Multimodal Representations? A Probing Perspective
    Salin, Emmanuelle
    Farah, Badreddine
    Ayache, Stephane
    Favre, Benoit
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 11248 - 11257
  • [30] Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers
    Frank, Stella
    Bugliarello, Emanuele
    Elliott, Desmond
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 9847 - 9857