SAM: Modeling Scene, Object and Action With Semantics Attention Modules for Video Recognition

被引:6
|
作者
Zhang, Xing [1 ]
Wu, Zuxuan [2 ]
Jiang, Yu-Gang [2 ]
机构
[1] Fudan Univ, Acad Engn & Technol, Shanghai 200433, Peoples R China
[2] Fudan Univ, Sch Comp Sci, Shanghai 200433, Peoples R China
基金
国家重点研发计划;
关键词
Video recognition; scene; object; feature fusion; semantics attention; LATE FUSION;
D O I
10.1109/TMM.2021.3050058
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Video recognition aims at understanding semantic contents that normally involve the interactions of humans and related objects under certain scenes. A common practice to improve recognition accuracy is to combine object, scene and action features for classification directly, assuming that they are explicitly complementary. In this paper, we break down the fusion of three features into two pairwise feature relation modeling processes, which mitigates the difficulty of correlation learning in high dimensional features. Towards this goal, we introduce a Semantics Attention Module that captures the relations of a pair of features by refining the relatively "weak" feature with the guidance from the "strong" feature using attention mechanisms. The refined representation is further combined with the "strong" feature using a residual design for downstream tasks. Two SAMs are applied in a Semantics Attention Network (SAN) for improving video recognition. Extensive experiments are conducted on two large-scale video benchmarks, FCVID and ActivityNet v1.3-the proposed approach achieves better results while requiring much less computational effort than alternative methods.
引用
收藏
页码:313 / 322
页数:10
相关论文
共 50 条
  • [41] View-Invariant Object Category Learning, Attention, Recognition, Search, and Scene Understanding
    Grossberg, Stephen
    [J]. IJCNN: 2009 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1- 6, 2009, : 3507 - 3509
  • [42] Symbiotic Attention for Egocentric Action Recognition With Object-Centric Alignment
    Wang, Xiaohan
    Zhu, Linchao
    Wu, Yu
    Yang, Yi
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) : 6605 - 6617
  • [43] An attention mechanism based convolutional LSTM network for video action recognition
    Ge, Hongwei
    Yan, Zehang
    Yu, Wenhao
    Sun, Liang
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (14) : 20533 - 20556
  • [44] CHANNEL-WISE TEMPORAL ATTENTION NETWORK FOR VIDEO ACTION RECOGNITION
    Lei, Jianjun
    Jia, Yalong
    Peng, Bo
    Huang, Qingming
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 562 - 567
  • [45] Imperceptible Adversarial Attack With Multigranular Spatiotemporal Attention for Video Action Recognition
    Wu, Guoming
    Xu, Yangfan
    Li, Jun
    Shi, Zhiping
    Liu, Xianglong
    [J]. IEEE INTERNET OF THINGS JOURNAL, 2023, 10 (20) : 17785 - 17796
  • [46] Video action recognition method based on attention residual network and LSTM
    Zhang, Yu
    Dong, Pengyue
    [J]. PROCEEDINGS OF THE 33RD CHINESE CONTROL AND DECISION CONFERENCE (CCDC 2021), 2021, : 3611 - 3616
  • [47] Two-stream Graph Attention Convolutional for Video Action Recognition
    Zhang, Deyuan
    Gao, Hongwei
    Dai, Hailong
    Shi, Xiangbin
    [J]. 2021 IEEE 15TH INTERNATIONAL CONFERENCE ON BIG DATA SCIENCE AND ENGINEERING (BIGDATASE 2021), 2021, : 23 - 27
  • [48] An attention mechanism based convolutional LSTM network for video action recognition
    Hongwei Ge
    Zehang Yan
    Wenhao Yu
    Liang Sun
    [J]. Multimedia Tools and Applications, 2019, 78 : 20533 - 20556
  • [49] CAST: Cross-Attention in Space and Time for Video Action Recognition
    Lee, Dongho
    Lee, Jongseo
    Choi, Jinwoo
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [50] Metric-Based Attention Feature Learning for Video Action Recognition
    Kim, Dae Ha
    Anvarov, Fazliddin
    Lee, Jun Min
    Song, Byung Cheol
    [J]. IEEE ACCESS, 2021, 9 : 39218 - 39228