End-to-end Generative Pretraining for Multimodal Video Captioning

被引:62
|
作者
Seo, Paul Hongsuck [1 ]
Nagrani, Arsha [1 ]
Arnab, Anurag [1 ]
Schmid, Cordelia [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
关键词
D O I
10.1109/CVPR52688.2022.01743
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent video and language pretraining frameworks lack the ability to generate sentences. We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos which can be effectively used for generative tasks such as multimodal video captioning. Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly. To overcome the lack of captions in unlabelled videos, we leverage the future utterance as an additional text source and propose a bidirectional generation objective-we generate future utterances given the present mulitmodal context, and also the present utterance given future observations. With this objective, we train an encoder-decoder model end-to-end to generate a caption from raw pixels and transcribed speech directly. Our model achieves state-of-the-art performance for multimodal video captioning on four standard benchmarks, as well as for other video understanding tasks such as VideoQA, video retrieval and action classification.
引用
收藏
页码:17938 / 17947
页数:10
相关论文
共 50 条
  • [1] End-to-End Video Captioning
    Olivastri, Silvio
    Singh, Gurkirt
    Cuzzolin, Fabio
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 1474 - 1482
  • [2] End-to-End Dense Video Captioning with Masked Transformer
    Zhou, Luowei
    Zhou, Yingbo
    Corso, Jason J.
    Socher, Richard
    Xiong, Caiming
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 8739 - 8748
  • [3] End-to-End Dense Video Captioning with Parallel Decoding
    Wang, Teng
    Zhang, Ruimao
    Lu, Zhichao
    Zheng, Feng
    Cheng, Ran
    Luo, Ping
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6827 - 6837
  • [4] End-to-End Video Captioning with Multitask Reinforcement Learning
    Li, Lijun
    Gong, Boqing
    [J]. 2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 339 - 348
  • [5] Multimodal Pretraining for Dense Video Captioning
    Huang, Gabriel
    Pang, Bo
    Zhu, Zhenhai
    Rivera, Clara
    Soricut, Radu
    [J]. 1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 470 - 490
  • [6] SWINBERT: End-to-End Transformers with Sparse Attention for Video Captioning
    Lin, Kevin
    Li, Linjie
    Lin, Chung-Ching
    Ahmed, Faisal
    Gan, Zhe
    Liu, Zicheng
    Lu, Yumao
    Wang, Lijuan
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17928 - 17937
  • [7] An end-to-end generative framework for video segmentation and recognition
    Kuehne, Hilde
    Gall, Juergen
    Serre, Thomas
    [J]. 2016 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2016), 2016,
  • [8] A Generative Appearance Model for End-to-end Video Object Segmentation
    Johnander, Joakim
    Danelljan, Martin
    Brissman, Emil
    Khan, Fahad Shahbaz
    Felsberg, Michael
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 8945 - 8954
  • [9] End-to-End Dual-Stream Transformer with a Parallel Encoder for Video Captioning
    Ran, Yuting
    Fang, Bin
    Chen, Lei
    Wei, Xuekai
    Xian, Weizhi
    Zhou, Mingliang
    [J]. JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2024, 33 (04)
  • [10] End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering
    Yu, Youngjae
    Ko, Hyungjin
    Choi, Jongwook
    Kim, Gunhee
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 3261 - 3269