Multimodal Vision Transformers with Forced Attention for Behavior Analysis

被引：4

作者：

Agrawal, Tanay ^{[1
]}

Balazia, Michal ^{[1
]}

Muller, Philipp ^{[2
]}

Bremond, Francois ^{[1
]}

机构：

[1] INRIA, Valbonne, France

[2] DFKI, Saarbrucken, Germany

来源：

2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV) | 2023年

关键词：

PERSONALITY; JUDGMENTS;

D O I：

10.1109/WACV56688.2023.00339

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Human behavior understanding requires looking at minute details in the large context of a scene containing multiple input modalities. It is necessary as it allows the design of more human-like machines. While transformer approaches have shown great improvements, they face multiple challenges such as lack of data or background noise. To tackle these, we introduce the Forced Attention (FAt) Transformer which utilize forced attention with a modified backbone for input encoding and a use of additional inputs. In addition to improving the performance on different tasks and inputs, the modification requires less time and memory resources. We provide a model for a generalised feature extraction for tasks concerning social signals and behavior analysis. Our focus is on understanding behavior in videos where people are interacting with each other or talking into the camera which simulates the first person point of view in social interaction. FAt Transformers are applied to two downstream tasks: personality recognition and body language recognition. We achieve state-of-the-art results for Udiva v0.5, First Impressions v2 and MPII Group Interaction datasets. We further provide an extensive ablation study of the proposed architecture.

引用

页码：3381 / 3391

页数：11

共 50 条

[1] Vision Transformers with Hierarchical Attention
Liu, Yun
Wu, Yu-Huan
Sun, Guolei
Zhang, Le
Chhatkuli, Ajad
Van Gool, Luc
MACHINE INTELLIGENCE RESEARCH, 2024, 21 (04) : 670 - 683
[2] Constituent Attention for Vision Transformers
Li, Haoling
Xue, Mengqi
Song, Jie
Zhang, Haofei
Huang, Wenqi
Liang, Lingyu
Song, Mingli
COMPUTER VISION AND IMAGE UNDERSTANDING, 2023, 237
[3] Multimodal Token Fusion for Vision Transformers
Wang, Yikai
Chen, Xinghao
Cao, Lele
Huang, Wenbing
Sun, Fuchun
Wang, Yunhe
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 12176 - 12185
[4] Robustifying Token Attention for Vision Transformers
Guo, Yong
Stutz, David
Schiele, Bernt
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 17511 - 17522
[5] Efficient Vision Transformers with Partial Attention
Vo, Xuan-Thuy
Nguyen, Duy-Linh
Priadana, Adri
Jo, Kang-Hyun
COMPUTER VISION - ECCV 2024, PT LXXXIII, 2025, 15141 : 298 - 317
[6] Fast Vision Transformers with HiLo Attention
Pan, Zizheng
Cai, Jianfei
Zhuang, Bohan
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
[7] DaViT: Dual Attention Vision Transformers
Ding, Mingyu
Xiao, Bin
Codella, Noel
Luo, Ping
Wang, Jingdong
Yuan, Lu
COMPUTER VISION, ECCV 2022, PT XXIV, 2022, 13684 : 74 - 92
[8] ViTDroid: Vision Transformers for Efficient, Explainable Attention to Malicious Behavior in Android Binaries
Syed, Toqeer Ali
Nauman, Mohammad
Khan, Sohail
Jan, Salman
Zuhairi, Megat F.
SENSORS, 2024, 24 (20)
[9] Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers
Sahiner, Arda
Ergen, Tolga
Ozturkler, Batu
Pauly, John
Mardani, Morteza
Pilanci, Mert
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022, : 19050 - 19088
[10] AttnZero: Efficient Attention Discovery for Vision Transformers
Li, Lujun
Wei, Zimian
Dong, Peijie
Luo, Wenhan
Xue, Wei
Liu, Qifeng
Guo, Yike
COMPUTER VISION - ECCV 2024, PT V, 2025, 15063 : 20 - 37

← 1 2 3 4 5 →