SAM: Modeling Scene, Object and Action With Semantics Attention Modules for Video Recognition

被引:6
|
作者
Zhang, Xing [1 ]
Wu, Zuxuan [2 ]
Jiang, Yu-Gang [2 ]
机构
[1] Fudan Univ, Acad Engn & Technol, Shanghai 200433, Peoples R China
[2] Fudan Univ, Sch Comp Sci, Shanghai 200433, Peoples R China
基金
国家重点研发计划;
关键词
Video recognition; scene; object; feature fusion; semantics attention; LATE FUSION;
D O I
10.1109/TMM.2021.3050058
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Video recognition aims at understanding semantic contents that normally involve the interactions of humans and related objects under certain scenes. A common practice to improve recognition accuracy is to combine object, scene and action features for classification directly, assuming that they are explicitly complementary. In this paper, we break down the fusion of three features into two pairwise feature relation modeling processes, which mitigates the difficulty of correlation learning in high dimensional features. Towards this goal, we introduce a Semantics Attention Module that captures the relations of a pair of features by refining the relatively "weak" feature with the guidance from the "strong" feature using attention mechanisms. The refined representation is further combined with the "strong" feature using a residual design for downstream tasks. Two SAMs are applied in a Semantics Attention Network (SAN) for improving video recognition. Extensive experiments are conducted on two large-scale video benchmarks, FCVID and ActivityNet v1.3-the proposed approach achieves better results while requiring much less computational effort than alternative methods.
引用
收藏
页码:313 / 322
页数:10
相关论文
共 50 条
  • [21] Recurrent Region Attention and Video Frame Attention Based Video Action Recognition Network Design
    基于循环区域关注和视频帧关注的视频行为识别网络设计
    [J]. Zhao, Zi-Yu (Maikuraky1022@outlook.com), 1600, Chinese Institute of Electronics (48): : 1052 - 1061
  • [22] Content-Attention Representation by Factorized Action-Scene Network for Action Recognition
    Hou, Jingyi
    Wu, Xinxiao
    Sun, Yuchao
    Jia, Yunde
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (06) : 1537 - 1547
  • [23] Object, Scene and Actions: Combining Multiple Features for Human Action Recognition
    Ikizler-Cinbis, Nazli
    Sclaroff, Stan
    [J]. COMPUTER VISION-ECCV 2010, PT I, 2010, 6311 : 494 - 507
  • [24] FactorNet: Holistic Actor, Object, and Scene Factorization for Action Recognition in Videos
    Nigam, Nitika
    Dutta, Tanima
    Gupta, Hari Prabhat
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (03) : 976 - 991
  • [25] MORAN: A Multi-Object Rectified Attention Network for scene text recognition
    Luo, Canjie
    Jin, Lianwen
    Sun, Zenghui
    [J]. PATTERN RECOGNITION, 2019, 90 : 109 - 118
  • [26] Temporal U-Nets for Video Summarization with Scene and Action Recognition
    Kwon, Heeseung
    Shim, Woohyun
    Cho, Minsu
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 1541 - 1544
  • [27] Attention based consistent semantic learning for micro-video scene recognition
    Guo, Jie
    Nie, Xiushan
    Ma, Yuling
    Shaheed, Kashif
    Ullah, Inam
    Yin, Yilong
    [J]. INFORMATION SCIENCES, 2021, 543 : 504 - 516
  • [28] Interpretable Spatio-temporal Attention for Video Action Recognition
    Meng, Lili
    Zhao, Bo
    Chang, Bo
    Huang, Gao
    Sun, Wei
    Tung, Frederich
    Sigal, Leonid
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 1513 - 1522
  • [29] Human Skeleton Graph Attention Convolutional for Video Action Recognition
    Zhang, Deyuan
    Gao, Hongwei
    Dai, Hailong
    Shi, Xiangbin
    [J]. 2020 5TH INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE, COMPUTER TECHNOLOGY AND TRANSPORTATION (ISCTT 2020), 2020, : 183 - 187
  • [30] Multipath Attention and Adaptive Gating Network for Video Action Recognition
    Haiping Zhang
    Zepeng Hu
    Dongjin Yu
    Liming Guan
    Xu Liu
    Conghao Ma
    [J]. Neural Processing Letters, 56