SAM: Modeling Scene, Object and Action With Semantics Attention Modules for Video Recognition

被引:6
|
作者
Zhang, Xing [1 ]
Wu, Zuxuan [2 ]
Jiang, Yu-Gang [2 ]
机构
[1] Fudan Univ, Acad Engn & Technol, Shanghai 200433, Peoples R China
[2] Fudan Univ, Sch Comp Sci, Shanghai 200433, Peoples R China
基金
国家重点研发计划;
关键词
Video recognition; scene; object; feature fusion; semantics attention; LATE FUSION;
D O I
10.1109/TMM.2021.3050058
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Video recognition aims at understanding semantic contents that normally involve the interactions of humans and related objects under certain scenes. A common practice to improve recognition accuracy is to combine object, scene and action features for classification directly, assuming that they are explicitly complementary. In this paper, we break down the fusion of three features into two pairwise feature relation modeling processes, which mitigates the difficulty of correlation learning in high dimensional features. Towards this goal, we introduce a Semantics Attention Module that captures the relations of a pair of features by refining the relatively "weak" feature with the guidance from the "strong" feature using attention mechanisms. The refined representation is further combined with the "strong" feature using a residual design for downstream tasks. Two SAMs are applied in a Semantics Attention Network (SAN) for improving video recognition. Extensive experiments are conducted on two large-scale video benchmarks, FCVID and ActivityNet v1.3-the proposed approach achieves better results while requiring much less computational effort than alternative methods.
引用
收藏
页码:313 / 322
页数:10
相关论文
共 50 条
  • [1] RGB-D Scene Recognition based on Object-Scene Relation and Semantics-Preserving Attention
    Guo, Yuhui
    Liang, Xun
    [J]. PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 127 - 134
  • [2] Human action recognition based on scene semantics
    Tao Hu
    Xinyan Zhu
    Wei Guo
    Shaohua Wang
    Jianfeng Zhu
    [J]. Multimedia Tools and Applications, 2019, 78 : 28515 - 28536
  • [3] Human action recognition based on scene semantics
    Hu, Tao
    Zhu, Xinyan
    Guo, Wei
    Wang, Shaohua
    Zhu, Jianfeng
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (20) : 28515 - 28536
  • [4] Modeling the Relationship of Action, Object and Scene
    Liu, Jing
    Wu, Xinxiao
    Feng, Yang
    [J]. 2014 22ND INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2014, : 2005 - 2010
  • [5] Fusing Object Semantics and Deep Appearance Features for Scene Recognition
    Sun, Ning
    Li, Wenli
    Liu, Jixin
    Han, Guang
    Wu, Cong
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2019, 29 (06) : 1715 - 1728
  • [6] Selective visual attention in object recognition and scene analysis
    Gonzaga, A
    Neves, EMD
    Slaets, AFF
    [J]. APPLICATIONS OF DIGITAL IMAGE PROCESSING XXI, 1998, 3460 : 254 - 264
  • [7] Harnessing Object and Scene Semantics for Large-Scale Video Understanding
    Wu, Zuxuan
    Fu, Yanwei
    Jiang, Yu-Gang
    Sigal, Leonid
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 3112 - 3121
  • [8] ACTION RECOGNITION IMPROVED BY CORRELATIONS AND ATTENTION OF SUBJECTS AND SCENE
    Ha, Manh-Hung
    Chen, Oscal Tzyh-Chiang
    [J]. 2021 INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2021,
  • [9] SAM: Self Attention Mechanism for Scene Text Recognition Based on Swin Transformer
    Shuai, Xiang
    Wang, Xiao
    Wang, Wei
    Yuan, Xin
    Xu, Xin
    [J]. MULTIMEDIA MODELING (MMM 2022), PT I, 2022, 13141 : 443 - 454
  • [10] Cascaded attention and grouping for object recognition from video
    Greindl, C
    Goyal, A
    Ogris, G
    Paletta, L
    [J]. 12TH INTERNATIONAL CONFERENCE ON IMAGE ANALYSIS AND PROCESSING, PROCEEDINGS, 2003, : 448 - 453