Multimodal Vision Transformers with Forced Attention for Behavior Analysis

被引：4

作者：

Agrawal, Tanay ^{[1
]}

Balazia, Michal ^{[1
]}

Muller, Philipp ^{[2
]}

Bremond, Francois ^{[1
]}

机构：

[1] INRIA, Valbonne, France

[2] DFKI, Saarbrucken, Germany

来源：

2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV) | 2023年

关键词：

PERSONALITY; JUDGMENTS;

D O I：

10.1109/WACV56688.2023.00339

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Human behavior understanding requires looking at minute details in the large context of a scene containing multiple input modalities. It is necessary as it allows the design of more human-like machines. While transformer approaches have shown great improvements, they face multiple challenges such as lack of data or background noise. To tackle these, we introduce the Forced Attention (FAt) Transformer which utilize forced attention with a modified backbone for input encoding and a use of additional inputs. In addition to improving the performance on different tasks and inputs, the modification requires less time and memory resources. We provide a model for a generalised feature extraction for tasks concerning social signals and behavior analysis. Our focus is on understanding behavior in videos where people are interacting with each other or talking into the camera which simulates the first person point of view in social interaction. FAt Transformers are applied to two downstream tasks: personality recognition and body language recognition. We achieve state-of-the-art results for Udiva v0.5, First Impressions v2 and MPII Group Interaction datasets. We further provide an extensive ablation study of the proposed architecture.

引用

页码：3381 / 3391

页数：11

共 50 条

[21] Twins: Revisiting the Design of Spatial Attention in Vision Transformers
Chu, Xiangxiang
Tian, Zhi
Wang, Yuqing
Zhang, Bo
Ren, Haibing
Wei, Xiaolin
Xia, Huaxia
Shen, Chunhua
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
[22] ResViT: Residual Vision Transformers for Multimodal Medical Image Synthesis
Dalmaz, Onat
Yurt, Mahmut
Cukur, Tolga
IEEE TRANSACTIONS ON MEDICAL IMAGING, 2022, 41 (10) : 2598 - 2614
[23] From Multimodal to Unimodal Attention in Transformers using Knowledge Distillation
Agarwal, Dhruv
Agrawal, Tanay
Ferrari, Laura M.
Bremond, Francois
2021 17TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS 2021), 2021,
[24] Are You Paying Attention? Multimodal Linear Attention Transformers for Affect Prediction in Video Conversations
Poh, Jia Qing
See, John
El Gayar, Neamat
Wong, Lai-Kuan
PROCEEDINGS OF THE 2ND INTERNATIONAL WORKSHOP ON MULTIMODAL AND RESPONSIBLE AFFECTIVE COMPUTING, MRAC 2024, 2024, : 15 - 23
[25] Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets
Chen, Xiangyu
Hu, Qinghao
Li, Kaidong
Zhong, Cuncong
Wang, Guanghui
2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 3973 - 3981
[26] An Attention-Based Token Pruning Method for Vision Transformers
Luo, Kaicheng
Li, Huaxiong
Zhou, Xianzhong
Huang, Bing
ROUGH SETS, IJCRS 2022, 2022, 13633 : 274 - 288
[27] RAWAtten: Reconfigurable Accelerator for Window Attention in Hierarchical Vision Transformers
Li, Wantong
Luo, Yandong
Yu, Shimeng
2023 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION, DATE, 2023,
[28] Focal Attention for Long-Range Interactions in Vision Transformers
Yang, Jianwei
Li, Chunyuan
Zhang, Pengchuan
Dai, Xiyang
Xiao, Bin
Yuan, Lu
Gao, Jianfeng
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[29] Are Vision-Language Transformers Learning Multimodal Representations? A Probing Perspective
Salin, Emmanuelle
Farah, Badreddine
Ayache, Stephane
Favre, Benoit
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 11248 - 11257
[30] Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers
Frank, Stella
Bugliarello, Emanuele
Elliott, Desmond
2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 9847 - 9857

← 1 2 3 4 5 →