Spotting and Aggregating Salient Regions for Video Captioning

被引:18
|
作者
Wang, Huiyun [1 ]
Xu, Youjiang [1 ]
Han, Yahong [1 ]
机构
[1] Tianjin Univ, Sch Comp Sci & Technol, Tianjin, Peoples R China
关键词
Video Captioning; Salient Regions; Spatio-Temporal Representation;
D O I
10.1145/3240508.3240677
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Towards an interpretable video captioning process, we target to locate salient regions of video objects along with the sequentially uttering words. This paper proposes a new framework to automatically spot salient regions in each video frame and simultaneously learn a discriminative spatio-temporal representation for video captioning. First, in a Spot Module, we automatically learn the saliency value of each location to separate salient regions from video content as the foreground and the rest as background by two operations of 'hard separation' and 'soft separation', respectively. Then, in an Aggregate Module, to aggregate the foreground/background descriptors into a discriminative spatio-temporal representation, we devise a trainable video VLAD process to learn the aggregation parameters. Finally, we utilize the attention mechanism to decode the spatio-temporal representations of different regions into video descriptions. Experiments on two benchmark datasets demonstrate our method outperforms most of the state-of-the-art methods in terms of Bleu@4, METEOR and CIDEr metrics for the task of video captioning. Also examples demonstrate our method can successfully utter words to sequentially salient regions of video objects.
引用
收藏
页码:1519 / 1526
页数:8
相关论文
共 50 条
  • [1] Video content representation as salient regions of activity
    Moënne-Loccoz, N
    Bruno, E
    Marchand-Maillet, S
    IMAGE AND VIDEO RETRIEVAL, PROCEEDINGS, 2004, 3115 : 384 - 392
  • [2] Catching the Temporal Regions-of-Interest for Video Captioning
    Yang, Ziwei
    Han, Yahong
    Wang, Zheng
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 146 - 153
  • [3] Decoding Task States by Spotting Salient Patterns at Time Points and Brain Regions
    Chan, Yi Hao
    Gupta, Sukrit
    Kasun, L. L. Chamara
    Rajapakse, Jagath C.
    MACHINE LEARNING IN CLINICAL NEUROIMAGING AND RADIOGENOMICS IN NEURO-ONCOLOGY, MLCN 2020, RNO-AI 2020, 2020, 12449 : 88 - 97
  • [4] Staying at the Technology Forefront by Spotting the Reverse Salient: The Case of Digital Video Broadcasting
    Dedehayir, O.
    Hornsby, A.
    2008 IEEE INTERNATIONAL CONFERENCE ON MANAGEMENT OF INNOVATION AND TECHNOLOGY, VOLS 1-3, 2008, : 189 - +
  • [5] ROBUST VIDEO FINGERPRINTS USING POSITIONS OF SALIENT REGIONS
    Ouali, Chahid
    Dumouchel, Pierre
    Gupta, Vishwa
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 3041 - 3045
  • [6] Multi-scale Spatial-Temporal Feature Aggregating for Video Salient Object Segmentation
    Mu, Changhong
    Yuan, Zebin
    Ouyang, Xiuqin
    Wang, Bo
    2019 IEEE 4TH INTERNATIONAL CONFERENCE ON SIGNAL AND IMAGE PROCESSING (ICSIP 2019), 2019, : 224 - 229
  • [7] Salient Feature Extraction Mechanism for Image Captioning
    Wang X.
    Song Y.-H.
    Zhang Y.-L.
    Zidonghua Xuebao/Acta Automatica Sinica, 2022, 48 (03): : 735 - 746
  • [8] Video captioning – a survey
    Vaishnavi J.
    Narmatha V.
    Multimedia Tools and Applications, 2025, 84 (2) : 947 - 978
  • [9] Image/video captioning
    Ushiku Y.
    Ushiku, Yoshitaka, 2018, Inst. of Image Information and Television Engineers (72): : 650 - 654
  • [10] Salient object detection by aggregating contextual information
    Liu, Yan
    Zhang, Yunzhou
    Liu, Shichang
    Coleman, Sonya
    Wang, Zhenyu
    Qiu, Feng
    PATTERN RECOGNITION LETTERS, 2022, 153 : 190 - 199