Multimodal Pretraining for Dense Video Captioning

被引:0
|
作者
Huang, Gabriel [1 ,2 ,3 ]
Pang, Bo [3 ]
Zhu, Zhenhai [3 ]
Rivera, Clara [3 ]
Soricut, Radu [3 ]
机构
[1] Mila, Montreal, PQ, Canada
[2] Univ Montreal, Montreal, PQ, Canada
[3] Google Res, Mountain View, CA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Learning specific hands-on skills such as cooking, car maintenance, and home repairs increasingly happens via instructional videos. The user experience with such videos is known to be improved by meta-information such as time-stamped annotations for the main steps involved. Generating such annotations automatically is challenging, and we describe here two relevant contributions. First, we construct and release a new dense video captioning dataset, Video Timeline Tags (ViTT), featuring a variety of instructional videos together with time-stamped annotations. Second, we explore several multimodal sequence-to-sequence pretraining strategies that leverage large unsupervised datasets of videos and caption-like texts. We pretrain and subsequently finetune dense video captioning models using both YouCook2 and ViTT. We show that such models generalize well and are robust over a wide variety of instructional videos.
引用
收藏
页码:470 / 490
页数:21
相关论文
共 50 条
  • [1] End-to-end Generative Pretraining for Multimodal Video Captioning
    Seo, Paul Hongsuck
    Nagrani, Arsha
    Arnab, Anurag
    Schmid, Cordelia
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17938 - 17947
  • [2] PWS-DVC: Enhancing Weakly Supervised Dense Video Captioning With Pretraining Approach
    Choi, Wangyu
    Chen, Jiasi
    Yoon, Jongwon
    [J]. IEEE ACCESS, 2023, 11 : 128162 - 128174
  • [3] Multirate Multimodal Video Captioning
    Yang, Ziwei
    Xu, Youjiang
    Wang, Huiyun
    Wang, Bo
    Han, Yahong
    [J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1877 - 1882
  • [4] Survey of Dense Video Captioning
    Huang, Xiankai
    Zhang, Jiayu
    Wang, Xinyu
    Wang, Xiaochuan
    Liu, Ruijun
    [J]. Computer Engineering and Applications, 2023, 59 (12): : 28 - 48
  • [5] Streamlined Dense Video Captioning
    Mun, Jonghwan
    Yang, Linjie
    Ren, Zhou
    Xu, Ning
    Han, Bohyung
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3581 - +
  • [6] Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
    Yang, Antoine
    Nagrani, Arsha
    Seo, Paul Hongsuck
    Miech, Antoine
    Pont-Tuset, Jordi
    Laptev, Ivan
    Sivic, Josef
    Schmid, Cordelia
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10714 - 10726
  • [7] CapDet: Unifying Dense Captioning and Open-World Detection Pretraining
    Long, Yanxin
    Wen, Youpeng
    Han, Jianhua
    Xu, Hang
    Ren, Pengzhen
    Zhang, Wei
    Zhao, Shen
    Liang, Xiaodan
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 15233 - 15243
  • [8] Deep multimodal embedding for video captioning
    Jin Young Lee
    [J]. Multimedia Tools and Applications, 2019, 78 : 31793 - 31805
  • [9] Multimodal Feature Learning for Video Captioning
    Lee, Sujin
    Kim, Incheol
    [J]. MATHEMATICAL PROBLEMS IN ENGINEERING, 2018, 2018
  • [10] An Efficient Framework for Dense Video Captioning
    Suin, Maitreya
    Rajagopalan, A. N.
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 12039 - 12046