Transformer-Based Multi-Scale Feature Integration Network for Video Saliency Prediction

被引:22
|
作者
Zhou, Xiaofei [1 ]
Wu, Songhe [1 ]
Shi, Ran [2 ]
Zheng, Bolun [1 ]
Wang, Shuai [3 ,4 ]
Yin, Haibing [4 ,5 ]
Zhang, Jiyong [1 ]
Yan, Chenggang [1 ]
机构
[1] Hangzhou Dianzi Univ, Sch Automat, Hangzhou 310018, Peoples R China
[2] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, Nanjing 210094, Peoples R China
[3] Hangzhou Dianzi Univ, Sch Cyberspace, Hangzhou 310018, Peoples R China
[4] Hangzhou Dianzi Univ, Lishui Inst, Lishui 323000, Peoples R China
[5] Hangzhou Dianzi Univ, Sch Commun Engn, Hangzhou 310018, Peoples R China
关键词
Video saliency prediction; transformer; semantic guidance; hierarchical decoder; attention; FUSION; MODEL;
D O I
10.1109/TCSVT.2023.3278410
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Most cutting-edge video saliency prediction models rely on spatiotemporal features extracted by 3D convolutions due to its local contextual cues acquirement ability. However, the shortage of 3D convolutions is that it cannot effectively capture long-term spatiotemporal dependencies in videos. To address this limitation, we propose a novel Transformer-based Multi-scale Feature Integration Network (TMFI-Net) for video saliency prediction, where the proposed TMFI-Net consists of a semantic-guided encoder and a hierarchical decoder. Firstly, embarking on the Transformer-based multi-level spatiotemporal features, the semantic-guided encoder enhances the features by inserting the high-level feature into each level feature via a top-down pathway and a longitudinal connection, which endows the multi-level spatiotemporal features with rich contextual information. In this way, the features are steered to give more concerns to saliency regions. Secondly, the hierarchical decoder employs a multi-dimensional attention (MA) module to elevate features along channel, temporal, and spatial dimensions jointly. Successively, the hierarchical decoder deploys a progressive decoding block to conduct an initial saliency prediction, which provides a coarse localization of saliency regions. Lastly, considering the complementarity of different saliency predictions, we integrate all initial saliency prediction results into the final saliency map. Comprehensive experimental results on four video saliency datasets firmly demonstrate that our model achieves superior performance when compared with the state-of-the-art video saliency models. The code is available at https://github.com/wusonghe/TMFI-Net.
引用
收藏
页码:7696 / 7707
页数:12
相关论文
共 50 条
  • [1] Multi-Scale Spatiotemporal Feature Fusion Network for Video Saliency Prediction
    Zhang, Yunzuo
    Zhang, Tian
    Wu, Cunyu
    Tao, Ran
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 4183 - 4193
  • [2] MULTI-SCALE TRANSFORMER-BASED FEATURE COMBINATION FOR IMAGE RETRIEVAL
    Roig Mari, Carlos
    Varas Gonzalez, David
    Bou-Balust, Elisenda
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 3166 - 3170
  • [3] Transformer-based multi-scale feature fusion network for remote sensing change detection
    Liang, Shike
    Hua, Zhen
    Li, Jinjiang
    JOURNAL OF APPLIED REMOTE SENSING, 2022, 16 (04)
  • [4] Transformer-based Multi-scale Underwater Image Enhancement Network
    Yang, Ai-Ping
    Fang, Si-Jie
    Shao, Ming-Fu
    Zhang, Teng-Fei
    Dongbei Daxue Xuebao/Journal of Northeastern University, 2024, 45 (12): : 1696 - 1705
  • [5] TFNet: Transformer-Based Multi-Scale Feature Fusion Forest Fire Image Detection Network
    Liu, Hongying
    Zhang, Fuquan
    Xu, Yiqing
    Wang, Junling
    Lu, Hong
    Wei, Wei
    Zhu, Jun
    FIRE-SWITZERLAND, 2025, 8 (02):
  • [6] MULTI-SCALE TRANSFORMER NETWORK FOR SALIENCY PREDICTION ON 360-DEGREE IMAGES
    Lin, Xu
    Qing, Chunmei
    Tan, Junpeng
    Xu, Xiangmin
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 1700 - 1704
  • [7] Human pose estimation in complex background videos via Transformer-based multi-scale feature integration
    Cheng, Chen
    Xu, Huahu
    DISPLAYS, 2024, 84
  • [8] Transformer-Based Multi-Scale Feature Remote Sensing Image Classification Model
    Sun, Ting
    Li, Jun
    Zhou, Xiangrui
    Chen, Zan
    IEEE ACCESS, 2025, 13 : 34095 - 34104
  • [9] Transformer-Based Multi-Scale Data-Driven Wellbore Risk Prediction Method
    Zhang, Hongyuan
    Liu, Yupei
    Zhang, Xingquan
    Yin, Zhiming
    2024 7TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND BIG DATA, ICAIBD 2024, 2024, : 53 - 58
  • [10] Transformer-Based Multi-scale Optimization Network for Low-Light Image Enhancement
    Niu Y.
    Lin X.
    Xu H.
    Li Y.
    Chen Y.
    Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence, 2023, 36 (06): : 511 - 529