Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning

被引:110
|
作者
Wang, Jingwen [1 ,2 ]
Jiang, Wenhao [2 ]
Ma, Lin [2 ]
Liu, Wei [2 ]
Xu, Yong [1 ]
机构
[1] South China Univ Technol, Guangzhou, Guangdong, Peoples R China
[2] Tencent AI Lab, Bellevue, WA USA
关键词
D O I
10.1109/CVPR.2018.00751
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dense video captioning is a newly emerging task that aims at both localizing and describing all events in a video. We identify and tackle two challenges on this task, namely, (1) how to utilize both past and future contexts for accurate event proposal predictions, and (2) how to construct informative input to the decoder for generating natural event descriptions. First, previous works predominantly generate temporal event proposals in the forward direction, which neglects future video context. We propose a bidirectional proposal method that effectively exploits both past and future contexts to make proposal predictions. Second, different events ending at (nearly) the same time are indistinguishable in the previous works, resulting in the same captions. We solve this problem by representing each event with an attentive fusion of hidden states from the proposal module and video contents (e.g., C3D features). We further propose a novel context gating mechanism to balance the contributionsf rom the current event and its surrounding contexts dynamically. We empirically show that our attentively fused event representation is superior to the proposal hidden states or video contents alone. By coupling proposal and captioning modules into one unified framework, our model outperforms the state-of-the-arts on the ActivityNet Captions dataset with a relative gain of over 100% (Meteor score increases from 4.82 to 9.65).
引用
收藏
页码:7190 / 7198
页数:9
相关论文
共 50 条
  • [41] Sports Video Captioning by Attentive Motion Representation based Hierarchical Recurrent Neural Networks
    Qi, Mengshi
    Wang, Yunhong
    Li, Annan
    Luo, Jiebo
    PROCEEDINGS OF THE 1ST INTERNATIONAL WORKSHOP ON MULTIMEDIA CONTENT ANALYSIS IN SPORTS (MMSPORTS'18), 2018, : 77 - 85
  • [42] Textual Context-Aware Dense Captioning With Diverse Words
    Shao, Zhuang
    Han, Jungong
    Debattista, Kurt
    Pang, Yanwei
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 8753 - 8766
  • [43] Object-aware Aggregation with Bidirectional Temporal Graph for Video Captioning
    Zhang, Junchao
    Peng, Yuxin
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 8319 - 8328
  • [44] iMakeup: Makeup Instructional Video Dataset for Fine-Grained Dense Video Captioning
    Lin X.
    Jin Q.
    Chen S.
    Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2019, 31 (08): : 1350 - 1357
  • [45] Multi-scale Attentive Fusion Network for Remote Sensing Image Change Captioning
    Chen, Cai
    Wang, Yi
    Yap, Kim-Hui
    2024 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, ISCAS 2024, 2024,
  • [46] iMakeup: Makeup Instructional Video Dataset for Fine-Grained Dense Video Captioning
    Lin, Xiaozhu
    Jin, Qin
    Chen, Shizhe
    Song, Yuqing
    Zhao, Yida
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT III, 2018, 11166 : 78 - 88
  • [47] Sketch, Ground, and Refine: Top-Down Dense Video Captioning
    Deng, Chaorui
    Chen, Shizhe
    Chen, Da
    He, Yuan
    Wu, Qi
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 234 - 243
  • [48] Long Short-Term Relation Transformer With Global Gating for Video Captioning
    Li, Liang
    Gao, Xingyu
    Deng, Jincan
    Tu, Yunbin
    Zha, Zheng-Jun
    Huang, Qingming
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 2726 - 2738
  • [49] Dense Video Captioning Using Graph-Based Sentence Summarization
    Zhang, Zhiwang
    Xu, Dong
    Ouyang, Wanli
    Zhou, Luping
    IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 1799 - 1810
  • [50] Fully-attentive iterative networks for region-based controllable image and video captioning
    Cornia, Marcella
    Baraldi, Lorenzo
    Tal, Ayellet
    Cucchiara, Rita
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2023, 237