Spatio-Temporal Ranked-Attention Networks for Video Captioning

被引:0
|
作者
Cherian, Anoop [1 ]
Wang, Jue [2 ]
Hori, Chiori [1 ]
Marks, Tim K. [1 ]
机构
[1] Mitsubishi Elect Res Labs, Cambridge, MA 02139 USA
[2] Australian Natl Univ, Canberra, ACT, Australia
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Generating video descriptions automatically is a challenging task that involves a complex interplay between spatio-temporal visual features and language models. Given that videos consist of spatial (frame-level) features and their temporal evolutions, an effective captioning model should be able to attend to these different cues selectively. To this end, we propose a Spatio-Temporal and Temporo-Spatial (STaTS) attention model which, conditioned on the language state, hierarchically combines spatial and temporal attention to videos in two different orders: (i) a spatiotemporal (ST) sub-model, which first attends to regions that have temporal evolution, then temporally pools the features from these regions; and (ii) a temporo-spatial (TS) sub-model, which first decides a single frame to attend to, then applies spatial attention within that frame. We propose a novel LSTM-based temporal ranking function, which we call ranked attention, for the ST model to capture action dynamics. Our entire framework is trained end-to-end. We provide experiments on two benchmark datasets: MSVD and MSR-VTT. Our results demonstrate the synergy between the ST and TS modules, outperforming recent state-of-the-art methods.
引用
收藏
页码:1606 / 1615
页数:10
相关论文
共 50 条
  • [31] Spatio-temporal Sampling for Video
    Shankar, Mohan
    Pitsiauis, Nikos P.
    Brady, David
    [J]. IMAGE RECONSTRUCTION FROM INCOMPLETE DATA V, 2008, 7076
  • [32] Spatio-Temporal Self-Attention Network for Video Saliency Prediction
    Wang, Ziqiang
    Liu, Zhi
    Li, Gongyang
    Wang, Yang
    Zhang, Tianhong
    Xu, Lihua
    Wang, Jijun
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 1161 - 1174
  • [33] Unified Spatio-Temporal Attention Networks for Action Recognition in Videos
    Li, Dong
    Yao, Ting
    Duan, Ling-Yu
    Mei, Tao
    Rui, Yong
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (02) : 416 - 428
  • [34] Attention Embedded Spatio-Temporal Network for Video Salient Object Detection
    Huang, Lili
    Yan, Pengxiang
    Li, Guanbin
    Wang, Qing
    Lin, Liang
    [J]. IEEE ACCESS, 2019, 7 : 166203 - 166213
  • [35] Temporal Attention Feature Encoding for Video Captioning
    Kim, Nayoung
    Ha, Seong Jong
    Kang, Je-Won
    [J]. 2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 1279 - 1282
  • [36] STSI: Efficiently Mine Spatio-Temporal Semantic Information between Different Multimodal for Video Captioning
    Xiong, Huiyu
    Wang, Lanxiao
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2022,
  • [37] Attention modulates spatio-temporal grouping
    Aydin, Murat
    Herzog, Michael H.
    Oegmen, Haluk
    [J]. VISION RESEARCH, 2011, 51 (04) : 435 - 446
  • [38] Neurocomputational approaches to spatio-temporal attention
    Simione, Luca
    Gigliotta, Onofrio
    [J]. COGNITIVE PROCESSING, 2015, 16 : S30 - S30
  • [39] ATTENTION NETWORKS IN THE LEFT AND RIGHT HEMISPHERES: A SPATIO-TEMPORAL EEG STUDY
    Sasin, Edyta
    Szumska, Izabela
    Jaskowski, Piotr
    [J]. PSYCHOPHYSIOLOGY, 2009, 46 : S43 - S43
  • [40] A video object detector with Spatio-Temporal Attention Module for micro UAV detection
    Xu, Haozhi
    Ling, Zhigang
    Yuan, Xiaofang
    Wang, Yaonan
    [J]. NEUROCOMPUTING, 2024, 597