Spotting and Aggregating Salient Regions for Video Captioning

被引:18
|
作者
Wang, Huiyun [1 ]
Xu, Youjiang [1 ]
Han, Yahong [1 ]
机构
[1] Tianjin Univ, Sch Comp Sci & Technol, Tianjin, Peoples R China
关键词
Video Captioning; Salient Regions; Spatio-Temporal Representation;
D O I
10.1145/3240508.3240677
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Towards an interpretable video captioning process, we target to locate salient regions of video objects along with the sequentially uttering words. This paper proposes a new framework to automatically spot salient regions in each video frame and simultaneously learn a discriminative spatio-temporal representation for video captioning. First, in a Spot Module, we automatically learn the saliency value of each location to separate salient regions from video content as the foreground and the rest as background by two operations of 'hard separation' and 'soft separation', respectively. Then, in an Aggregate Module, to aggregate the foreground/background descriptors into a discriminative spatio-temporal representation, we devise a trainable video VLAD process to learn the aggregation parameters. Finally, we utilize the attention mechanism to decode the spatio-temporal representations of different regions into video descriptions. Experiments on two benchmark datasets demonstrate our method outperforms most of the state-of-the-art methods in terms of Bleu@4, METEOR and CIDEr metrics for the task of video captioning. Also examples demonstrate our method can successfully utter words to sequentially salient regions of video objects.
引用
收藏
页码:1519 / 1526
页数:8
相关论文
共 50 条
  • [31] Bilingual video captioning model for enhanced video retrieval
    Norah Alrebdi
    Amal A. Al-Shargabi
    Journal of Big Data, 11
  • [32] From Video to Language: Survey of Video Captioning and Description
    Tang P.-J.
    Wang H.-L.
    Zidonghua Xuebao/Acta Automatica Sinica, 2022, 48 (02): : 375 - 397
  • [33] Bilingual video captioning model for enhanced video retrieval
    Alrebdi, Norah
    Al-Shargabi, Amal A.
    JOURNAL OF BIG DATA, 2024, 11 (01)
  • [34] Watch It Twice: Video Captioning with a Refocused Video Encoder
    Shi, Xiangxi
    Cai, Jianfei
    Joty, Shafiq
    Gu, Jiuxiang
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 818 - 826
  • [35] Incorporating the Graph Representation of Video and Text into Video Captioning
    Lu, Min
    Li, Yuan
    2022 IEEE 34TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2022, : 396 - 401
  • [36] CapSal: Leveraging Captioning to Boost Semantics for Salient Object Detection
    Zhang, Lu
    Zhang, Jianming
    Lin, Zhe
    Lu, Huchuan
    He, You
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6017 - 6026
  • [37] Video Interactive Captioning with Human Prompts
    Wu, Aming
    Han, Yahong
    Yang, Yi
    PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 961 - 967
  • [38] Deep multimodal embedding for video captioning
    Jin Young Lee
    Multimedia Tools and Applications, 2019, 78 : 31793 - 31805
  • [39] A Deep Structured Model for Video Captioning
    Vinodhini, V.
    Sathiyabhama, B.
    Sankar, S.
    Somula, Ramasubbareddy
    INTERNATIONAL JOURNAL OF GAMING AND COMPUTER-MEDIATED SIMULATIONS, 2020, 12 (02) : 44 - 56
  • [40] Hierarchical Modular Network for Video Captioning
    Ye, Hanhua
    Li, Guorong
    Qi, Yuankai
    Wang, Shuhui
    Huang, Qingming
    Yang, Ming-Hsuan
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17918 - 17927