Multimodal Pretraining for Dense Video Captioning

被引：0

作者：

Huang, Gabriel ^{[1
,2
,3
]}

Pang, Bo ^{[3
]}

Zhu, Zhenhai ^{[3
]}

Rivera, Clara ^{[3
]}

Soricut, Radu ^{[3
]}

机构：

[1] Mila, Montreal, PQ, Canada

[2] Univ Montreal, Montreal, PQ, Canada

[3] Google Res, Mountain View, CA USA

来源：

1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020) | 2020年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Learning specific hands-on skills such as cooking, car maintenance, and home repairs increasingly happens via instructional videos. The user experience with such videos is known to be improved by meta-information such as time-stamped annotations for the main steps involved. Generating such annotations automatically is challenging, and we describe here two relevant contributions. First, we construct and release a new dense video captioning dataset, Video Timeline Tags (ViTT), featuring a variety of instructional videos together with time-stamped annotations. Second, we explore several multimodal sequence-to-sequence pretraining strategies that leverage large unsupervised datasets of videos and caption-like texts. We pretrain and subsequently finetune dense video captioning models using both YouCook2 and ViTT. We show that such models generalize well and are robust over a wide variety of instructional videos.

引用

页码：470 / 490

页数：21

共 50 条

[1] End-to-end Generative Pretraining for Multimodal Video Captioning
Seo, Paul Hongsuck
Nagrani, Arsha
Arnab, Anurag
Schmid, Cordelia
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17938 - 17947
[2] PWS-DVC: Enhancing Weakly Supervised Dense Video Captioning With Pretraining Approach
Choi, Wangyu
Chen, Jiasi
Yoon, Jongwon
[J]. IEEE ACCESS, 2023, 11 : 128162 - 128174
[3] Multirate Multimodal Video Captioning
Yang, Ziwei
Xu, Youjiang
Wang, Huiyun
Wang, Bo
Han, Yahong
[J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1877 - 1882
[4] Survey of Dense Video Captioning
Huang, Xiankai
Zhang, Jiayu
Wang, Xinyu
Wang, Xiaochuan
Liu, Ruijun
[J]. Computer Engineering and Applications, 2023, 59 (12): : 28 - 48
[5] Streamlined Dense Video Captioning
Mun, Jonghwan
Yang, Linjie
Ren, Zhou
Xu, Ning
Han, Bohyung
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3581 - +
[6] Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
Yang, Antoine
Nagrani, Arsha
Seo, Paul Hongsuck
Miech, Antoine
Pont-Tuset, Jordi
Laptev, Ivan
Sivic, Josef
Schmid, Cordelia
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10714 - 10726
[7] CapDet: Unifying Dense Captioning and Open-World Detection Pretraining
Long, Yanxin
Wen, Youpeng
Han, Jianhua
Xu, Hang
Ren, Pengzhen
Zhang, Wei
Zhao, Shen
Liang, Xiaodan
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 15233 - 15243
[8] Deep multimodal embedding for video captioning
Jin Young Lee
[J]. Multimedia Tools and Applications, 2019, 78 : 31793 - 31805
[9] Multimodal Feature Learning for Video Captioning
Lee, Sujin
Kim, Incheol
[J]. MATHEMATICAL PROBLEMS IN ENGINEERING, 2018, 2018
[10] An Efficient Framework for Dense Video Captioning
Suin, Maitreya
Rajagopalan, A. N.
[J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 12039 - 12046

← 1 2 3 4 5 →