Spotting and Aggregating Salient Regions for Video Captioning

被引:18
|
作者
Wang, Huiyun [1 ]
Xu, Youjiang [1 ]
Han, Yahong [1 ]
机构
[1] Tianjin Univ, Sch Comp Sci & Technol, Tianjin, Peoples R China
关键词
Video Captioning; Salient Regions; Spatio-Temporal Representation;
D O I
10.1145/3240508.3240677
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Towards an interpretable video captioning process, we target to locate salient regions of video objects along with the sequentially uttering words. This paper proposes a new framework to automatically spot salient regions in each video frame and simultaneously learn a discriminative spatio-temporal representation for video captioning. First, in a Spot Module, we automatically learn the saliency value of each location to separate salient regions from video content as the foreground and the rest as background by two operations of 'hard separation' and 'soft separation', respectively. Then, in an Aggregate Module, to aggregate the foreground/background descriptors into a discriminative spatio-temporal representation, we devise a trainable video VLAD process to learn the aggregation parameters. Finally, we utilize the attention mechanism to decode the spatio-temporal representations of different regions into video descriptions. Experiments on two benchmark datasets demonstrate our method outperforms most of the state-of-the-art methods in terms of Bleu@4, METEOR and CIDEr metrics for the task of video captioning. Also examples demonstrate our method can successfully utter words to sequentially salient regions of video objects.
引用
收藏
页码:1519 / 1526
页数:8
相关论文
共 50 条
  • [21] Survey of Dense Video Captioning
    Huang, Xiankai
    Zhang, Jiayu
    Wang, Xinyu
    Wang, Xiaochuan
    Liu, Ruijun
    Computer Engineering and Applications, 2023, 59 (12): : 28 - 48
  • [22] Video Captioning with Tube Features
    Zhao, Bin
    Li, Xuelong
    Lu, Xiaoqiang
    PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 1177 - 1183
  • [23] Thinking Hallucination for Video Captioning
    Ullah, Nasib
    Mohanta, Partha Pratim
    COMPUTER VISION - ACCV 2022, PT IV, 2023, 13844 : 623 - 640
  • [24] Video Captioning by Adversarial LSTM
    Yang, Yang
    Zhou, Jie
    Ai, Jiangbo
    Bin, Yi
    Hanjalic, Alan
    Shen, Heng Tao
    Ji, Yanli
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2018, 27 (11) : 5600 - 5611
  • [25] Reconstruction Network for Video Captioning
    Wang, Bairui
    Ma, Lin
    Zhang, Wei
    Liu, Wei
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7622 - 7631
  • [26] Video Captioning with Semantic Guiding
    Yuan, Jin
    Tian, Chunna
    Zhang, Xiangnan
    Ding, Yuxuan
    Wei, Wei
    2018 IEEE FOURTH INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM), 2018,
  • [27] Aggregating Attentional Dilated Features for Salient Object Detection
    Zhu, Lei
    Chen, Jiaxing
    Hu, Xiaowei
    Fu, Chi-Wing
    Xu, Xuemiao
    Qin, Jing
    Heng, Pheng-Ann
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (10) : 3358 - 3371
  • [28] Recurrently Aggregating Deep Features for Salient Object Detection
    Hu, Xiaowei
    Zhu, Lei
    Qin, Jing
    Fu, Chi-Wing
    Heng, Pheng-Ann
    THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 6943 - 6950
  • [29] Knowledge-based detection of events in video streams from salient regions of activity
    Moënne-Loccoz, N
    Bruno, E
    Marchand-Maillet, S
    PATTERN ANALYSIS AND APPLICATIONS, 2005, 7 (04) : 422 - 429
  • [30] Knowledge-based detection of events in video streams from salient regions of activity
    Nicolas Moënne-Loccoz
    Eric Bruno
    Stéphane Marchand-Maillet
    Pattern Analysis and Applications, 2004, 7 : 422 - 429