End-to-End Dense Video Captioning with Masked Transformer

被引:333
|
作者
Zhou, Luowei [1 ]
Zhou, Yingbo [2 ]
Corso, Jason J. [1 ]
Socher, Richard [2 ]
Xiong, Caiming [2 ]
机构
[1] Univ Michigan, Ann Arbor, MI 48109 USA
[2] Salesforce Res, San Francisco, CA 94105 USA
关键词
D O I
10.1109/CVPR.2018.00911
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dense video captioning aims to generate text descriptions for all events in an untrimmed video. This involves both detecting and describing events. Therefore, all previous methods on dense video captioning tackle this problem by building two models, i.e. an event proposal and a captioning model, for these two sub-problems. The models are either trained separately or in alternation. This prevents direct influence of the language description to the event proposal, which is important for generating accurate descriptions. To address this problem, we propose an end-to-end transformer model for dense video captioning. The encoder encodes the video into appropriate representations. The proposal decoder decodes from the encoding with different anchors to form video event proposals. The captioning decoder employs a masking network to restrict its attention to the proposal event over the encoding feature. This masking network converts the event proposal to a differentiable mask, which ensures the consistency between the proposal and captioning during training. In addition, our model employs a self-attention mechanism, which enables the use of efficient non-recurrent structure during encoding and leads to performance improvements. We demonstrate the effectiveness of this end-to-end model on ActivityNet Captions and YouCookll datasets, where we achieved 10.12 and 6.58 METEOR score, respectively.
引用
收藏
页码:8739 / 8748
页数:10
相关论文
共 50 条
  • [1] End-to-End Dense Video Captioning with Parallel Decoding
    Wang, Teng
    Zhang, Ruimao
    Lu, Zhichao
    Zheng, Feng
    Cheng, Ran
    Luo, Ping
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6827 - 6837
  • [2] End-to-End Video Captioning
    Olivastri, Silvio
    Singh, Gurkirt
    Cuzzolin, Fabio
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 1474 - 1482
  • [3] Accelerated masked transformer for dense video captioning
    Yu, Zhou
    Han, Nanjia
    [J]. NEUROCOMPUTING, 2021, 445 : 72 - 80
  • [4] End-to-End Dual-Stream Transformer with a Parallel Encoder for Video Captioning
    Ran, Yuting
    Fang, Bin
    Chen, Lei
    Wei, Xuekai
    Xian, Weizhi
    Zhou, Mingliang
    [J]. JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2024, 33 (04)
  • [5] DCMSTRD: End-to-end Dense Captioning via Multi-Scale Transformer Decoding
    Shao, Zhuang
    Han, Jungong
    Debattista, Kurt
    Pang, Yanwei
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 7581 - 7593
  • [6] Video Caption Based Searching Using End-to-End Dense Captioning and Sentence Embeddings
    Aggarwal, Akshay
    Chauhan, Aniruddha
    Kumar, Deepika
    Mittal, Mamta
    Roy, Sudipta
    Kim, Tai-hoon
    [J]. SYMMETRY-BASEL, 2020, 12 (06):
  • [7] End-to-End Transformer Based Model for Image Captioning
    Wang, Yiyu
    Xu, Jungang
    Sun, Yingfei
    [J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 2585 - 2594
  • [8] End-to-end Generative Pretraining for Multimodal Video Captioning
    Seo, Paul Hongsuck
    Nagrani, Arsha
    Arnab, Anurag
    Schmid, Cordelia
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17938 - 17947
  • [9] End-to-End Video Captioning with Multitask Reinforcement Learning
    Li, Lijun
    Gong, Boqing
    [J]. 2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 339 - 348
  • [10] End-to-End Video Text Spotting with Transformer
    Wu, Weijia
    Cai, Yuanqiang
    Shen, Chunhua
    Zhang, Debing
    Fu, Ying
    Zhou, Hong
    Luo, Ping
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (09) : 4019 - 4035