Spatio-Temporal Ranked-Attention Networks for Video Captioning

被引:0
|
作者
Cherian, Anoop [1 ]
Wang, Jue [2 ]
Hori, Chiori [1 ]
Marks, Tim K. [1 ]
机构
[1] Mitsubishi Elect Res Labs, Cambridge, MA 02139 USA
[2] Australian Natl Univ, Canberra, ACT, Australia
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Generating video descriptions automatically is a challenging task that involves a complex interplay between spatio-temporal visual features and language models. Given that videos consist of spatial (frame-level) features and their temporal evolutions, an effective captioning model should be able to attend to these different cues selectively. To this end, we propose a Spatio-Temporal and Temporo-Spatial (STaTS) attention model which, conditioned on the language state, hierarchically combines spatial and temporal attention to videos in two different orders: (i) a spatiotemporal (ST) sub-model, which first attends to regions that have temporal evolution, then temporally pools the features from these regions; and (ii) a temporo-spatial (TS) sub-model, which first decides a single frame to attend to, then applies spatial attention within that frame. We propose a novel LSTM-based temporal ranking function, which we call ranked attention, for the ST model to capture action dynamics. Our entire framework is trained end-to-end. We provide experiments on two benchmark datasets: MSVD and MSR-VTT. Our results demonstrate the synergy between the ST and TS modules, outperforming recent state-of-the-art methods.
引用
收藏
页码:1606 / 1615
页数:10
相关论文
共 50 条
  • [1] Diverse Video Captioning by Adaptive Spatio-temporal Attention
    Ghaderi, Zohreh
    Salewski, Leonard
    Lensch, Hendrik P. A.
    [J]. PATTERN RECOGNITION, DAGM GCPR 2022, 2022, 13485 : 409 - 425
  • [2] Spatio-Temporal Attention Models for Grounded Video Captioning
    Zanfir, Mihai
    Marinoiu, Elisabeta
    Sminchisescu, Cristian
    [J]. COMPUTER VISION - ACCV 2016, PT IV, 2017, 10114 : 104 - 119
  • [3] Video Captioning via Sentence Augmentation and Spatio-Temporal Attention
    Chen, Tseng-Hung
    Zeng, Kuo-Hao
    Hsu, Wan-Ting
    Sun, Min
    [J]. COMPUTER VISION - ACCV 2016 WORKSHOPS, PT I, 2017, 10116 : 269 - 286
  • [4] Capturing Temporal Structures for Video Captioning by Spatio-temporal Contexts and Channel Attention Mechanism
    Dashan Guo
    Wei Li
    Xiangzhong Fang
    [J]. Neural Processing Letters, 2017, 46 : 313 - 328
  • [5] Capturing Temporal Structures for Video Captioning by Spatio-temporal Contexts and Channel Attention Mechanism
    Guo, Dashan
    Li, Wei
    Fang, Xiangzhong
    [J]. NEURAL PROCESSING LETTERS, 2017, 46 (01) : 313 - 328
  • [6] Spatio-Temporal Memory Attention for Image Captioning
    Ji, Junzhong
    Xu, Cheng
    Zhang, Xiaodan
    Wang, Boyue
    Song, Xinhang
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 7615 - 7628
  • [7] Exploring the Spatio-Temporal Aware Graph for video captioning
    Xue, Ping
    Zhou, Bing
    [J]. IET COMPUTER VISION, 2022, 16 (05) : 456 - 467
  • [8] Video Question Answering via Hierarchical Spatio-Temporal Attention Networks
    Zhao, Zhou
    Yang, Qifan
    Cai, Deng
    He, Xiaofei
    Zhuang, Yueting
    [J]. PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 3518 - 3524
  • [9] Video Captioning With Object-Aware Spatio-Temporal Correlation and Aggregation
    Zhang, Junchao
    Peng, Yuxin
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 (29) : 6209 - 6222
  • [10] Spatio-temporal video error concealment using priority-ranked
    Chen, Y
    Sun, XY
    Wu, F
    Lin, ZK
    Li, SP
    [J]. 2005 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), VOLS 1-5, 2005, : 1741 - 1744