Capturing Temporal Structures for Video Captioning by Spatio-temporal Contexts and Channel Attention Mechanism

被引:5
|
作者
Guo, Dashan [1 ]
Li, Wei [1 ]
Fang, Xiangzhong [1 ]
机构
[1] Shanghai Jiao Tong Univ, Inst Image Commun & Informat Proc, Shanghai 200240, Peoples R China
关键词
Video captioning; Recurrent convolution networks; Spatio-temporal contexts; Channel attention mechanism; RECOGNITION;
D O I
10.1007/s11063-017-9591-9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
To generate a natural language description for videos, there has been tremendous interest in developing deep neural networks with the integration of temporal structures in different categories. Considering the spatial and temporal domains inherent in video frames, we contend that the video dynamics and the spatio-temporal contexts are both important for captioning, which correspond to two different temporal structures. However, while the video dynamics is well investigated, the spatio-temporal contexts have not been given sufficient attention. In this paper, we take both structures into account and propose a novel recurrent convolution model for captioning. Firstly, for a comprehensive and detailed representation, we propose to aggregate the local and global spatio-temporal contexts in the recurrent convolution networks. Secondly, to capture much subtler temporal dynamics, the channel attention mechanism is introduced and it helps to understand the involvement of the frame feature maps with the captioning process. Finally, a qualitative comparison with several variants of our model demonstrates the effectiveness of incorporating these two structures. Moreover, experiments on YouTube2Text dataset have shown that the proposed method achieves competitive performance to other state-of-the-art methods.
引用
收藏
页码:313 / 328
页数:16
相关论文
共 50 条
  • [1] Capturing Temporal Structures for Video Captioning by Spatio-temporal Contexts and Channel Attention Mechanism
    Dashan Guo
    Wei Li
    Xiangzhong Fang
    [J]. Neural Processing Letters, 2017, 46 : 313 - 328
  • [2] Diverse Video Captioning by Adaptive Spatio-temporal Attention
    Ghaderi, Zohreh
    Salewski, Leonard
    Lensch, Hendrik P. A.
    [J]. PATTERN RECOGNITION, DAGM GCPR 2022, 2022, 13485 : 409 - 425
  • [3] Spatio-Temporal Attention Models for Grounded Video Captioning
    Zanfir, Mihai
    Marinoiu, Elisabeta
    Sminchisescu, Cristian
    [J]. COMPUTER VISION - ACCV 2016, PT IV, 2017, 10114 : 104 - 119
  • [4] Video Captioning via Sentence Augmentation and Spatio-Temporal Attention
    Chen, Tseng-Hung
    Zeng, Kuo-Hao
    Hsu, Wan-Ting
    Sun, Min
    [J]. COMPUTER VISION - ACCV 2016 WORKSHOPS, PT I, 2017, 10116 : 269 - 286
  • [5] Spatio-Temporal Ranked-Attention Networks for Video Captioning
    Cherian, Anoop
    Wang, Jue
    Hori, Chiori
    Marks, Tim K.
    [J]. 2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2020, : 1606 - 1615
  • [6] Spatio-Temporal Memory Attention for Image Captioning
    Ji, Junzhong
    Xu, Cheng
    Zhang, Xiaodan
    Wang, Boyue
    Song, Xinhang
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 7615 - 7628
  • [7] Spatio-Temporal Video Denoising Based on Attention Mechanism
    Ji, Kai
    Lei, Weimin
    Zhang, Wei
    [J]. INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2023, 37 (06)
  • [8] Exploring the Spatio-Temporal Aware Graph for video captioning
    Xue, Ping
    Zhou, Bing
    [J]. IET COMPUTER VISION, 2022, 16 (05) : 456 - 467
  • [9] KSF-ST: Video Captioning Based on Key Semantic Frames Extraction and Spatio-Temporal Attention Mechanism
    Qu, Zhaowei
    Zhang, Luhan
    Wang, Xiaoru
    Cao, Bingyu
    Li, Yueli
    Li, Fu
    [J]. 2020 16TH INTERNATIONAL WIRELESS COMMUNICATIONS & MOBILE COMPUTING CONFERENCE, IWCMC, 2020, : 1388 - 1393
  • [10] Capturing the spatio-temporal continuity for video semantic segmentation
    Chen, Xin
    Wu, Aming
    Han, Yahong
    [J]. IET IMAGE PROCESSING, 2019, 13 (14) : 2813 - 2820