SAM: Modeling Scene, Object and Action With Semantics Attention Modules for Video Recognition

被引:6
|
作者
Zhang, Xing [1 ]
Wu, Zuxuan [2 ]
Jiang, Yu-Gang [2 ]
机构
[1] Fudan Univ, Acad Engn & Technol, Shanghai 200433, Peoples R China
[2] Fudan Univ, Sch Comp Sci, Shanghai 200433, Peoples R China
基金
国家重点研发计划;
关键词
Video recognition; scene; object; feature fusion; semantics attention; LATE FUSION;
D O I
10.1109/TMM.2021.3050058
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Video recognition aims at understanding semantic contents that normally involve the interactions of humans and related objects under certain scenes. A common practice to improve recognition accuracy is to combine object, scene and action features for classification directly, assuming that they are explicitly complementary. In this paper, we break down the fusion of three features into two pairwise feature relation modeling processes, which mitigates the difficulty of correlation learning in high dimensional features. Towards this goal, we introduce a Semantics Attention Module that captures the relations of a pair of features by refining the relatively "weak" feature with the guidance from the "strong" feature using attention mechanisms. The refined representation is further combined with the "strong" feature using a residual design for downstream tasks. Two SAMs are applied in a Semantics Attention Network (SAN) for improving video recognition. Extensive experiments are conducted on two large-scale video benchmarks, FCVID and ActivityNet v1.3-the proposed approach achieves better results while requiring much less computational effort than alternative methods.
引用
收藏
页码:313 / 322
页数:10
相关论文
共 50 条
  • [31] Efficient dual attention SlowFast networks for video action recognition
    Wei, Dafeng
    Tian, Ye
    Wei, Liqing
    Zhong, Hong
    Chen, Siqian
    Pu, Shiliang
    Lu, Hongtao
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2022, 222
  • [32] Alignment-guided Temporal Attention for Video Action Recognition
    Zhao, Yizhou
    Li, Zhenyang
    Guo, Xun
    Lu, Yan
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [33] Multipath Attention and Adaptive Gating Network for Video Action Recognition
    Zhang, Haiping
    Hu, Zepeng
    Yu, Dongjin
    Guan, Liming
    Liu, Xu
    Ma, Conghao
    [J]. NEURAL PROCESSING LETTERS, 2024, 56 (02)
  • [34] SDAN: Stacked Diverse Attention Network for Video Action Recognition
    Zhu, Xiaoguang
    Huang, Siran
    Fan, Wenjing
    Cheng, Yuhao
    Shao, Huaqing
    Liu, Peilin
    [J]. 2021 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2021,
  • [35] Composite Object Relation Modeling for Few-Shot Scene Recognition
    Song, Xinhang
    Liu, Chenlong
    Zeng, Haitao
    Zhu, Yaohui
    Chen, Gongwei
    Qin, Xiaorong
    Jiang, Shuqiang
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 5678 - 5691
  • [36] Inter-object discriminative graph modeling for indoor scene recognition
    Song, Chuanxin
    Wu, Hanbo
    Ma, Xin
    [J]. KNOWLEDGE-BASED SYSTEMS, 2024, 302
  • [37] Animation Scene Object Recognition and Modeling Based on Computer Vision Technology
    Shen, Zhengzhong
    Zhang, Wei
    [J]. Computer-Aided Design and Applications, 2024, 21 (S15): : 16 - 34
  • [38] Attention and Language Ensemble for Scene Text Recognition with Convolutional Sequence Modeling
    Fang, Shancheng
    Xie, Hongtao
    Zha, Zheng-Jun
    Sun, Nannan
    Tan, Jianlong
    Zhang, Yongdong
    [J]. PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 248 - 256
  • [39] Modeling Scene and Object Contexts for Human Action Retrieval with Few Examples
    Jiang, Yu-Gang
    Li, Zhenguo
    Chang, Shih-Fu
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2011, 21 (05) : 674 - 681
  • [40] MODELING RESEARCH FOR INTERESTED CHARACTER RECOGNITION AND SCENE TRACKING IN VIDEO IMAGE
    Wang, Senhua
    Li, Xiangzhong
    Wang, Weijia
    Bai, Zhenxun
    [J]. 2013 10TH INTERNATIONAL COMPUTER CONFERENCE ON WAVELET ACTIVE MEDIA TECHNOLOGY AND INFORMATION PROCESSING (ICCWAMTIP), 2013, : 135 - 138