Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning

被引:110
|
作者
Wang, Jingwen [1 ,2 ]
Jiang, Wenhao [2 ]
Ma, Lin [2 ]
Liu, Wei [2 ]
Xu, Yong [1 ]
机构
[1] South China Univ Technol, Guangzhou, Guangdong, Peoples R China
[2] Tencent AI Lab, Bellevue, WA USA
关键词
D O I
10.1109/CVPR.2018.00751
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dense video captioning is a newly emerging task that aims at both localizing and describing all events in a video. We identify and tackle two challenges on this task, namely, (1) how to utilize both past and future contexts for accurate event proposal predictions, and (2) how to construct informative input to the decoder for generating natural event descriptions. First, previous works predominantly generate temporal event proposals in the forward direction, which neglects future video context. We propose a bidirectional proposal method that effectively exploits both past and future contexts to make proposal predictions. Second, different events ending at (nearly) the same time are indistinguishable in the previous works, resulting in the same captions. We solve this problem by representing each event with an attentive fusion of hidden states from the proposal module and video contents (e.g., C3D features). We further propose a novel context gating mechanism to balance the contributionsf rom the current event and its surrounding contexts dynamically. We empirically show that our attentively fused event representation is superior to the proposal hidden states or video contents alone. By coupling proposal and captioning modules into one unified framework, our model outperforms the state-of-the-arts on the ActivityNet Captions dataset with a relative gain of over 100% (Meteor score increases from 4.82 to 9.65).
引用
收藏
页码:7190 / 7198
页数:9
相关论文
共 50 条
  • [21] Hierarchical Language Modeling for Dense Video Captioning
    Dave, Jaivik
    Padmavathi, S.
    INVENTIVE COMPUTATION AND INFORMATION TECHNOLOGIES, ICICIT 2021, 2022, 336 : 421 - 431
  • [22] Dense video captioning based on local attention
    Qian, Yong
    Mao, Yingchi
    Chen, Zhihao
    Li, Chang
    Bloh, Olano Teah
    Huang, Qian
    IET IMAGE PROCESSING, 2023, 17 (09) : 2673 - 2685
  • [23] Accelerated masked transformer for dense video captioning
    Yu, Zhou
    Han, Nanjia
    NEUROCOMPUTING, 2021, 445 : 72 - 80
  • [24] TopicDVC: Dense Video Captioning with Topic Guidance
    Chen, Wei
    2024 IEEE 10TH INTERNATIONAL CONFERENCE ON EDGE COMPUTING AND SCALABLE CLOUD, EDGECOM 2024, 2024, : 82 - 87
  • [25] Dense Captioning with Joint Inference and Visual Context
    Yang, Linjie
    Tang, Kevin
    Yang, Jianchao
    Li, Li-Jia
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 1978 - 1987
  • [26] Incorporating attentive multi-scale context information for image captioning
    Jeripothula Prudviraj
    Yenduri Sravani
    C. Krishna Mohan
    Multimedia Tools and Applications, 2023, 82 : 10017 - 10037
  • [27] Incorporating attentive multi-scale context information for image captioning
    Prudviraj, Jeripothula
    Sravani, Yenduri
    Mohan, C. Krishna
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (07) : 10017 - 10037
  • [28] Jointly Localizing and Describing Events for Dense Video Captioning
    Li, Yehao
    Yao, Ting
    Pan, Yingwei
    Chao, Hongyang
    Mei, Tao
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7492 - 7500
  • [29] Step by Step: A Gradual Approach for Dense Video Captioning
    Choi, Wangyu
    Chen, Jiasi
    Yoon, Jongwon
    IEEE ACCESS, 2023, 11 : 51949 - 51959
  • [30] Dense video captioning using unsupervised semantic information
    Estevam, Valter
    Laroca, Rayson
    Pedrini, Helio
    Menotti, David
    Journal of Visual Communication and Image Representation, 2025, 107