End-to-end Generative Pretraining for Multimodal Video Captioning

被引：62

作者：

Seo, Paul Hongsuck ^{[1
]}

Nagrani, Arsha ^{[1
]}

Arnab, Anurag ^{[1
]}

Schmid, Cordelia ^{[1
]}

机构：

[1] Google Res, Mountain View, CA 94043 USA

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年

关键词：

D O I：

10.1109/CVPR52688.2022.01743

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent video and language pretraining frameworks lack the ability to generate sentences. We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos which can be effectively used for generative tasks such as multimodal video captioning. Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly. To overcome the lack of captions in unlabelled videos, we leverage the future utterance as an additional text source and propose a bidirectional generation objective-we generate future utterances given the present mulitmodal context, and also the present utterance given future observations. With this objective, we train an encoder-decoder model end-to-end to generate a caption from raw pixels and transcribed speech directly. Our model achieves state-of-the-art performance for multimodal video captioning on four standard benchmarks, as well as for other video understanding tasks such as VideoQA, video retrieval and action classification.

引用

页码：17938 / 17947

页数：10

共 50 条

[1] End-to-End Video Captioning
Olivastri, Silvio
Singh, Gurkirt
Cuzzolin, Fabio
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 1474 - 1482
[2] End-to-End Dense Video Captioning with Masked Transformer
Zhou, Luowei
Zhou, Yingbo
Corso, Jason J.
Socher, Richard
Xiong, Caiming
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 8739 - 8748
[3] End-to-End Dense Video Captioning with Parallel Decoding
Wang, Teng
Zhang, Ruimao
Lu, Zhichao
Zheng, Feng
Cheng, Ran
Luo, Ping
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6827 - 6837
[4] End-to-End Video Captioning with Multitask Reinforcement Learning
Li, Lijun
Gong, Boqing
[J]. 2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 339 - 348
[5] Multimodal Pretraining for Dense Video Captioning
Huang, Gabriel
Pang, Bo
Zhu, Zhenhai
Rivera, Clara
Soricut, Radu
[J]. 1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 470 - 490
[6] SWINBERT: End-to-End Transformers with Sparse Attention for Video Captioning
Lin, Kevin
Li, Linjie
Lin, Chung-Ching
Ahmed, Faisal
Gan, Zhe
Liu, Zicheng
Lu, Yumao
Wang, Lijuan
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17928 - 17937
[7] An end-to-end generative framework for video segmentation and recognition
Kuehne, Hilde
Gall, Juergen
Serre, Thomas
[J]. 2016 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2016), 2016,
[8] A Generative Appearance Model for End-to-end Video Object Segmentation
Johnander, Joakim
Danelljan, Martin
Brissman, Emil
Khan, Fahad Shahbaz
Felsberg, Michael
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 8945 - 8954
[9] End-to-End Dual-Stream Transformer with a Parallel Encoder for Video Captioning
Ran, Yuting
Fang, Bin
Chen, Lei
Wei, Xuekai
Xian, Weizhi
Zhou, Mingliang
[J]. JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2024, 33 (04)
[10] End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering
Yu, Youngjae
Ko, Hyungjin
Choi, Jongwook
Kim, Gunhee
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 3261 - 3269

← 1 2 3 4 5 →