SAM: Modeling Scene, Object and Action With Semantics Attention Modules for Video Recognition

被引：6

作者：

Zhang, Xing ^{[1
]}

Wu, Zuxuan ^{[2
]}

Jiang, Yu-Gang ^{[2
]}

机构：

[1] Fudan Univ, Acad Engn & Technol, Shanghai 200433, Peoples R China

[2] Fudan Univ, Sch Comp Sci, Shanghai 200433, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2022年 / 24卷

基金：

国家重点研发计划;

关键词：

Video recognition; scene; object; feature fusion; semantics attention; LATE FUSION;

D O I：

10.1109/TMM.2021.3050058

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Video recognition aims at understanding semantic contents that normally involve the interactions of humans and related objects under certain scenes. A common practice to improve recognition accuracy is to combine object, scene and action features for classification directly, assuming that they are explicitly complementary. In this paper, we break down the fusion of three features into two pairwise feature relation modeling processes, which mitigates the difficulty of correlation learning in high dimensional features. Towards this goal, we introduce a Semantics Attention Module that captures the relations of a pair of features by refining the relatively "weak" feature with the guidance from the "strong" feature using attention mechanisms. The refined representation is further combined with the "strong" feature using a residual design for downstream tasks. Two SAMs are applied in a Semantics Attention Network (SAN) for improving video recognition. Extensive experiments are conducted on two large-scale video benchmarks, FCVID and ActivityNet v1.3-the proposed approach achieves better results while requiring much less computational effort than alternative methods.

引用

页码：313 / 322

页数：10

共 50 条

[41] View-Invariant Object Category Learning, Attention, Recognition, Search, and Scene Understanding
Grossberg, Stephen
[J]. IJCNN: 2009 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1- 6, 2009, : 3507 - 3509
[42] Symbiotic Attention for Egocentric Action Recognition With Object-Centric Alignment
Wang, Xiaohan
Zhu, Linchao
Wu, Yu
Yang, Yi
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) : 6605 - 6617
[43] An attention mechanism based convolutional LSTM network for video action recognition
Ge, Hongwei
Yan, Zehang
Yu, Wenhao
Sun, Liang
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (14) : 20533 - 20556
[44] CHANNEL-WISE TEMPORAL ATTENTION NETWORK FOR VIDEO ACTION RECOGNITION
Lei, Jianjun
Jia, Yalong
Peng, Bo
Huang, Qingming
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 562 - 567
[45] Imperceptible Adversarial Attack With Multigranular Spatiotemporal Attention for Video Action Recognition
Wu, Guoming
Xu, Yangfan
Li, Jun
Shi, Zhiping
Liu, Xianglong
[J]. IEEE INTERNET OF THINGS JOURNAL, 2023, 10 (20) : 17785 - 17796
[46] Video action recognition method based on attention residual network and LSTM
Zhang, Yu
Dong, Pengyue
[J]. PROCEEDINGS OF THE 33RD CHINESE CONTROL AND DECISION CONFERENCE (CCDC 2021), 2021, : 3611 - 3616
[47] Two-stream Graph Attention Convolutional for Video Action Recognition
Zhang, Deyuan
Gao, Hongwei
Dai, Hailong
Shi, Xiangbin
[J]. 2021 IEEE 15TH INTERNATIONAL CONFERENCE ON BIG DATA SCIENCE AND ENGINEERING (BIGDATASE 2021), 2021, : 23 - 27
[48] An attention mechanism based convolutional LSTM network for video action recognition
Hongwei Ge
Zehang Yan
Wenhao Yu
Liang Sun
[J]. Multimedia Tools and Applications, 2019, 78 : 20533 - 20556
[49] CAST: Cross-Attention in Space and Time for Video Action Recognition
Lee, Dongho
Lee, Jongseo
Choi, Jinwoo
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[50] Metric-Based Attention Feature Learning for Video Action Recognition
Kim, Dae Ha
Anvarov, Fazliddin
Lee, Jun Min
Song, Byung Cheol
[J]. IEEE ACCESS, 2021, 9 : 39218 - 39228

← 1 2 3 4 5 →