End-to-End Dense Video Captioning with Masked Transformer

被引：333

作者：

Zhou, Luowei ^{[1
]}

Zhou, Yingbo ^{[2
]}

Corso, Jason J. ^{[1
]}

Socher, Richard ^{[2
]}

Xiong, Caiming ^{[2
]}

机构：

[1] Univ Michigan, Ann Arbor, MI 48109 USA

[2] Salesforce Res, San Francisco, CA 94105 USA

来源：

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2018年

关键词：

D O I：

10.1109/CVPR.2018.00911

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Dense video captioning aims to generate text descriptions for all events in an untrimmed video. This involves both detecting and describing events. Therefore, all previous methods on dense video captioning tackle this problem by building two models, i.e. an event proposal and a captioning model, for these two sub-problems. The models are either trained separately or in alternation. This prevents direct influence of the language description to the event proposal, which is important for generating accurate descriptions. To address this problem, we propose an end-to-end transformer model for dense video captioning. The encoder encodes the video into appropriate representations. The proposal decoder decodes from the encoding with different anchors to form video event proposals. The captioning decoder employs a masking network to restrict its attention to the proposal event over the encoding feature. This masking network converts the event proposal to a differentiable mask, which ensures the consistency between the proposal and captioning during training. In addition, our model employs a self-attention mechanism, which enables the use of efficient non-recurrent structure during encoding and leads to performance improvements. We demonstrate the effectiveness of this end-to-end model on ActivityNet Captions and YouCookll datasets, where we achieved 10.12 and 6.58 METEOR score, respectively.

引用

页码：8739 / 8748

页数：10

共 50 条

[1] End-to-End Dense Video Captioning with Parallel Decoding
Wang, Teng
Zhang, Ruimao
Lu, Zhichao
Zheng, Feng
Cheng, Ran
Luo, Ping
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6827 - 6837
[2] End-to-End Video Captioning
Olivastri, Silvio
Singh, Gurkirt
Cuzzolin, Fabio
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 1474 - 1482
[3] Accelerated masked transformer for dense video captioning
Yu, Zhou
Han, Nanjia
[J]. NEUROCOMPUTING, 2021, 445 : 72 - 80
[4] End-to-End Dual-Stream Transformer with a Parallel Encoder for Video Captioning
Ran, Yuting
Fang, Bin
Chen, Lei
Wei, Xuekai
Xian, Weizhi
Zhou, Mingliang
[J]. JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2024, 33 (04)
[5] DCMSTRD: End-to-end Dense Captioning via Multi-Scale Transformer Decoding
Shao, Zhuang
Han, Jungong
Debattista, Kurt
Pang, Yanwei
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 7581 - 7593
[6] Video Caption Based Searching Using End-to-End Dense Captioning and Sentence Embeddings
Aggarwal, Akshay
Chauhan, Aniruddha
Kumar, Deepika
Mittal, Mamta
Roy, Sudipta
Kim, Tai-hoon
[J]. SYMMETRY-BASEL, 2020, 12 (06):
[7] End-to-End Transformer Based Model for Image Captioning
Wang, Yiyu
Xu, Jungang
Sun, Yingfei
[J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 2585 - 2594
[8] End-to-end Generative Pretraining for Multimodal Video Captioning
Seo, Paul Hongsuck
Nagrani, Arsha
Arnab, Anurag
Schmid, Cordelia
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17938 - 17947
[9] End-to-End Video Captioning with Multitask Reinforcement Learning
Li, Lijun
Gong, Boqing
[J]. 2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 339 - 348
[10] End-to-End Video Text Spotting with Transformer
Wu, Weijia
Cai, Yuanqiang
Shen, Chunhua
Zhang, Debing
Fu, Ying
Zhou, Hong
Luo, Ping
[J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (09) : 4019 - 4035

← 1 2 3 4 5 →