Multimodal Pretraining for Dense Video Captioning

被引:0
|
作者
Huang, Gabriel [1 ,2 ,3 ]
Pang, Bo [3 ]
Zhu, Zhenhai [3 ]
Rivera, Clara [3 ]
Soricut, Radu [3 ]
机构
[1] Mila, Montreal, PQ, Canada
[2] Univ Montreal, Montreal, PQ, Canada
[3] Google Res, Mountain View, CA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Learning specific hands-on skills such as cooking, car maintenance, and home repairs increasingly happens via instructional videos. The user experience with such videos is known to be improved by meta-information such as time-stamped annotations for the main steps involved. Generating such annotations automatically is challenging, and we describe here two relevant contributions. First, we construct and release a new dense video captioning dataset, Video Timeline Tags (ViTT), featuring a variety of instructional videos together with time-stamped annotations. Second, we explore several multimodal sequence-to-sequence pretraining strategies that leverage large unsupervised datasets of videos and caption-like texts. We pretrain and subsequently finetune dense video captioning models using both YouCook2 and ViTT. We show that such models generalize well and are robust over a wide variety of instructional videos.
引用
收藏
页码:470 / 490
页数:21
相关论文
共 50 条
  • [21] TopicDVC: Dense Video Captioning with Topic Guidance
    Chen, Wei
    [J]. 2024 IEEE 10TH INTERNATIONAL CONFERENCE ON EDGE COMPUTING AND SCALABLE CLOUD, EDGECOM 2024, 2024, : 82 - 87
  • [22] Multimodal attention-based transformer for video captioning
    Hemalatha Munusamy
    Chandra Sekhar C
    [J]. Applied Intelligence, 2023, 53 : 23349 - 23368
  • [23] Jointly Localizing and Describing Events for Dense Video Captioning
    Li, Yehao
    Yao, Ting
    Pan, Yingwei
    Chao, Hongyang
    Mei, Tao
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7492 - 7500
  • [24] Multimodal graph neural network for video procedural captioning
    Ji, Lei
    Tu, Rongcheng
    Lin, Kevin
    Wang, Lijuan
    Duan, Nan
    [J]. NEUROCOMPUTING, 2022, 488 : 88 - 96
  • [25] Concept Parser With Multimodal Graph Learning for Video Captioning
    Wu, Bofeng
    Liu, Buyu
    Huang, Peng
    Bao, Jun
    Peng, Xi
    Yu, Jun
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 4484 - 4495
  • [26] Multimodal attention-based transformer for video captioning
    Munusamy, Hemalatha
    Sekhar, C. Chandra
    [J]. APPLIED INTELLIGENCE, 2023, 53 (20) : 23349 - 23368
  • [27] Learning Multimodal Attention LSTM Networks for Video Captioning
    Xu, Jun
    Yao, Ting
    Zhang, Yongdong
    Mei, Tao
    [J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 537 - 545
  • [28] Step by Step: A Gradual Approach for Dense Video Captioning
    Choi, Wangyu
    Chen, Jiasi
    Yoon, Jongwon
    [J]. IEEE ACCESS, 2023, 11 : 51949 - 51959
  • [29] Dense Video Captioning With Early Linguistic Information Fusion
    Aafaq, Nayyer
    Mian, Ajmal
    Akhtar, Naveed
    Liu, Wei
    Shah, Mubarak
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2309 - 2322
  • [30] Dense video captioning using unsupervised semantic information
    Estevam, Valter
    Laroca, Rayson
    Pedrini, Helio
    Menotti, David
    [J]. Journal of Visual Communication and Image Representation, 2025, 107