Action Recognition with Bootstrapping based Long-range Temporal Context Attention

被引：7

作者：

Liu, Ziming ^{[1
]}

Gao, Guangyu ^{[1
]}

Qin, A. K. ^{[2
]}

Wu, Tong ^{[1
]}

Liu, Chi Harold ^{[1
]}

机构：

[1] Beijing Inst Technol, Beijing, Peoples R China

[2] Swinburne Univ Technol, Melbourne, Vic, Australia

来源：

PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19) | 2019年

基金：

澳大利亚研究理事会; 中国国家自然科学基金;

关键词：

Action recognition; Context; self-attention; Bootstrapping attention;

D O I：

10.1145/3343031.3350916

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Actions always refer to complex vision variations in a long-range redundant video sequence. Instead of focusing on limited range sequence, i.e. convolution on adjacent frames, in this paper, we proposed an action recognition approach with bootstrapping based long-range temporal context attention. Specifically, due to vision variations of the local region across frames, we target at capturing temporal context by proposing the Temporal Pixels based Parallel-head Attention (TPPA) block. In TPPA, we apply the self-attention mechanism between local regions at the same position across temporal frames to capture the interaction impacts. Meanwhile, to deal with video redundancy and capture long-range context, the TPPA is extended to the Random Frames based Bootstrapping Attention (RFBA) framework. While the bootstrapping sampling frames have the same distribution of the whole video sequence, the RFBA not only captures longer temporal context with only a few sampling frames but also has comprehensive representation through multiple sampling. Furthermore, we also try to apply this temporal context attention to image-based action recognition, by transforming the image into "pseudo video" with the spatial shift. Finally, we conduct extensive experiments and empirical evaluations on two most popular datasets: UCF101 for videos and Stanford40 for images. In particular, our approach achieves top-1 accuracy of 91.7% in UCF101 and mAP of 90.9% in Stanford40.

引用

页码：583 / 591

页数：9

共 50 条

[1] Deep video compression based on Long-range Temporal Context Learning
Wu, Kejun
Li, Zhenxing
Yang, You
Liu, Qiong
COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 248
[2] Long-Range Hand Gesture Recognition via Attention-based SSD Network
Zhou, Liguang
Du, Chenping
Sun, Zhenglong
Lam, Tin Lun
Xu, Yangsheng
2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2021), 2021, : 1832 - 1838
[3] Modeling Long-Range Context for Concurrent Dialogue Acts Recognition
Yu, Yue
Peng, Siyao
Yang, Grace Hui
PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM '19), 2019, : 2277 - 2280
[4] LRTD: long-range temporal dependency based active learning for surgical workflow recognition
Xueying Shi
Yueming Jin
Qi Dou
Pheng-Ann Heng
International Journal of Computer Assisted Radiology and Surgery, 2020, 15 : 1573 - 1584
[5] LRTD: long-range temporal dependency based active learning for surgical workflow recognition
Shi, Xueying
Jin, Yueming
Dou, Qi
Heng, Pheng-Ann
INTERNATIONAL JOURNAL OF COMPUTER ASSISTED RADIOLOGY AND SURGERY, 2020, 15 (09) : 1573 - 1584
[6] Representing Long-Range Context for Graph Neural Networks with Global Attention
Wu, Zhanghao
Jain, Paras
Wright, Matthew A.
Mirhoseini, Azalia
Gonzalez, Joseph E.
Stoica, Ion
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[7] Bootstrapping the long-range sing model in three dimensions
Behan, Connor
JOURNAL OF PHYSICS A-MATHEMATICAL AND THEORETICAL, 2019, 52 (07)
[8] Enhancing long-range Automatic Target Recognition using spatial context
Rodger, Iain
Abbott, Rachael
Connor, Barry
Robertson, Neil
2017 SENSOR SIGNAL PROCESSING FOR DEFENCE CONFERENCE (SSPD), 2017, : 227 - 232
[9] Voluntary control of long-range motion integration via selective attention to context
Freeman, Elliot
Driver, Jon
JOURNAL OF VISION, 2008, 8 (11):
[10] Do Long-Range Language Models Actually Use Long-Range Context?
Sun, Simeng
Krishna, Kalpesh
Mattarella-Micke, Andrew
Iyyer, Mohit
2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 807 - 822

← 1 2 3 4 5 →