Multimodal Pretraining for Dense Video Captioning

被引：0

作者：

Huang, Gabriel ^{[1
,2
,3
]}

Pang, Bo ^{[3
]}

Zhu, Zhenhai ^{[3
]}

Rivera, Clara ^{[3
]}

Soricut, Radu ^{[3
]}

机构：

[1] Mila, Montreal, PQ, Canada

[2] Univ Montreal, Montreal, PQ, Canada

[3] Google Res, Mountain View, CA USA

来源：

1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020) | 2020年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Learning specific hands-on skills such as cooking, car maintenance, and home repairs increasingly happens via instructional videos. The user experience with such videos is known to be improved by meta-information such as time-stamped annotations for the main steps involved. Generating such annotations automatically is challenging, and we describe here two relevant contributions. First, we construct and release a new dense video captioning dataset, Video Timeline Tags (ViTT), featuring a variety of instructional videos together with time-stamped annotations. Second, we explore several multimodal sequence-to-sequence pretraining strategies that leverage large unsupervised datasets of videos and caption-like texts. We pretrain and subsequently finetune dense video captioning models using both YouCook2 and ViTT. We show that such models generalize well and are robust over a wide variety of instructional videos.

引用

页码：470 / 490

页数：21

共 50 条

[21] TopicDVC: Dense Video Captioning with Topic Guidance
Chen, Wei
[J]. 2024 IEEE 10TH INTERNATIONAL CONFERENCE ON EDGE COMPUTING AND SCALABLE CLOUD, EDGECOM 2024, 2024, : 82 - 87
[22] Multimodal attention-based transformer for video captioning
Hemalatha Munusamy
Chandra Sekhar C
[J]. Applied Intelligence, 2023, 53 : 23349 - 23368
[23] Jointly Localizing and Describing Events for Dense Video Captioning
Li, Yehao
Yao, Ting
Pan, Yingwei
Chao, Hongyang
Mei, Tao
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7492 - 7500
[24] Multimodal graph neural network for video procedural captioning
Ji, Lei
Tu, Rongcheng
Lin, Kevin
Wang, Lijuan
Duan, Nan
[J]. NEUROCOMPUTING, 2022, 488 : 88 - 96
[25] Concept Parser With Multimodal Graph Learning for Video Captioning
Wu, Bofeng
Liu, Buyu
Huang, Peng
Bao, Jun
Peng, Xi
Yu, Jun
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 4484 - 4495
[26] Multimodal attention-based transformer for video captioning
Munusamy, Hemalatha
Sekhar, C. Chandra
[J]. APPLIED INTELLIGENCE, 2023, 53 (20) : 23349 - 23368
[27] Learning Multimodal Attention LSTM Networks for Video Captioning
Xu, Jun
Yao, Ting
Zhang, Yongdong
Mei, Tao
[J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 537 - 545
[28] Step by Step: A Gradual Approach for Dense Video Captioning
Choi, Wangyu
Chen, Jiasi
Yoon, Jongwon
[J]. IEEE ACCESS, 2023, 11 : 51949 - 51959
[29] Dense Video Captioning With Early Linguistic Information Fusion
Aafaq, Nayyer
Mian, Ajmal
Akhtar, Naveed
Liu, Wei
Shah, Mubarak
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2309 - 2322
[30] Dense video captioning using unsupervised semantic information
Estevam, Valter
Laroca, Rayson
Pedrini, Helio
Menotti, David
[J]. Journal of Visual Communication and Image Representation, 2025, 107

← 1 2 3 4 5 →