SAM: Modeling Scene, Object and Action With Semantics Attention Modules for Video Recognition

被引：6

作者：

Zhang, Xing ^{[1
]}

Wu, Zuxuan ^{[2
]}

Jiang, Yu-Gang ^{[2
]}

机构：

[1] Fudan Univ, Acad Engn & Technol, Shanghai 200433, Peoples R China

[2] Fudan Univ, Sch Comp Sci, Shanghai 200433, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2022年 / 24卷

基金：

国家重点研发计划;

关键词：

Video recognition; scene; object; feature fusion; semantics attention; LATE FUSION;

D O I：

10.1109/TMM.2021.3050058

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Video recognition aims at understanding semantic contents that normally involve the interactions of humans and related objects under certain scenes. A common practice to improve recognition accuracy is to combine object, scene and action features for classification directly, assuming that they are explicitly complementary. In this paper, we break down the fusion of three features into two pairwise feature relation modeling processes, which mitigates the difficulty of correlation learning in high dimensional features. Towards this goal, we introduce a Semantics Attention Module that captures the relations of a pair of features by refining the relatively "weak" feature with the guidance from the "strong" feature using attention mechanisms. The refined representation is further combined with the "strong" feature using a residual design for downstream tasks. Two SAMs are applied in a Semantics Attention Network (SAN) for improving video recognition. Extensive experiments are conducted on two large-scale video benchmarks, FCVID and ActivityNet v1.3-the proposed approach achieves better results while requiring much less computational effort than alternative methods.

引用

页码：313 / 322

页数：10

共 50 条

[1] RGB-D Scene Recognition based on Object-Scene Relation and Semantics-Preserving Attention
Guo, Yuhui
Liang, Xun
[J]. PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 127 - 134
[2] Human action recognition based on scene semantics
Tao Hu
Xinyan Zhu
Wei Guo
Shaohua Wang
Jianfeng Zhu
[J]. Multimedia Tools and Applications, 2019, 78 : 28515 - 28536
[3] Human action recognition based on scene semantics
Hu, Tao
Zhu, Xinyan
Guo, Wei
Wang, Shaohua
Zhu, Jianfeng
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (20) : 28515 - 28536
[4] Modeling the Relationship of Action, Object and Scene
Liu, Jing
Wu, Xinxiao
Feng, Yang
[J]. 2014 22ND INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2014, : 2005 - 2010
[5] Fusing Object Semantics and Deep Appearance Features for Scene Recognition
Sun, Ning
Li, Wenli
Liu, Jixin
Han, Guang
Wu, Cong
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2019, 29 (06) : 1715 - 1728
[6] Selective visual attention in object recognition and scene analysis
Gonzaga, A
Neves, EMD
Slaets, AFF
[J]. APPLICATIONS OF DIGITAL IMAGE PROCESSING XXI, 1998, 3460 : 254 - 264
[7] Harnessing Object and Scene Semantics for Large-Scale Video Understanding
Wu, Zuxuan
Fu, Yanwei
Jiang, Yu-Gang
Sigal, Leonid
[J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 3112 - 3121
[8] ACTION RECOGNITION IMPROVED BY CORRELATIONS AND ATTENTION OF SUBJECTS AND SCENE
Ha, Manh-Hung
Chen, Oscal Tzyh-Chiang
[J]. 2021 INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2021,
[9] SAM: Self Attention Mechanism for Scene Text Recognition Based on Swin Transformer
Shuai, Xiang
Wang, Xiao
Wang, Wei
Yuan, Xin
Xu, Xin
[J]. MULTIMEDIA MODELING (MMM 2022), PT I, 2022, 13141 : 443 - 454
[10] Cascaded attention and grouping for object recognition from video
Greindl, C
Goyal, A
Ogris, G
Paletta, L
[J]. 12TH INTERNATIONAL CONFERENCE ON IMAGE ANALYSIS AND PROCESSING, PROCEEDINGS, 2003, : 448 - 453

← 1 2 3 4 5 →