Unifying Event Detection and Captioning as Sequence Generation via Pre-training

被引:6
|
作者
Zhang, Qi [1 ]
Song, Yuqing [1 ]
Jin, Qin [1 ]
机构
[1] Renmin Univ China, Sch Informat, Beijing, Peoples R China
来源
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
Dense video captioning; Pre-training; Sequence generation;
D O I
10.1007/978-3-031-20059-5_21
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dense video captioning aims to generate corresponding text descriptions for a series of events in the untrimmed video, which can be divided into two sub-tasks, event detection and event captioning. Unlike previous works that tackle the two sub-tasks separately, recent works have focused on enhancing the inter-task association between the two sub-tasks. However, designing inter-task interactions for event detection and captioning is not trivial due to the large differences in their task specific solutions. Besides, previous event detection methods normally ignore temporal dependencies between events, leading to event redundancy or inconsistency problems. To tackle above the two defects, in this paper, we define event detection as a sequence generation task and propose a unified pre-training and fine-tuning framework to naturally enhance the inter-task association between event detection and captioning. Since the model predicts each event with previous events as context, the inter-dependency between events is fully exploited and thus our model can detect more diverse and consistent events in the video. Experiments on the ActivityNet dataset show that our model outperforms the state-of-the-art methods, and can be further boosted when pre-trained on extra large-scale video-text data. Code is available at https://github.com/QiQAng/UEDVC.
引用
收藏
页码:363 / 379
页数:17
相关论文
共 50 条
  • [1] DU-VLG: Unifying Vision-and-Language Generation via Dual Sequence-to-Sequence Pre-training
    Huang, Luyang
    Niu, Guocheng
    Liu, Jiachen
    Xiao, Xinyan
    Wu, Hua
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 2552 - 2566
  • [2] MASS: Masked Sequence to Sequence Pre-training for Language Generation
    Song, Kaitao
    Tan, Xu
    Qin, Tao
    Lu, Jianfeng
    Liu, Tie-Yan
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
  • [3] Denoising based Sequence-to-Sequence Pre-training for Text Generation
    Wang, Liang
    Zhao, Wei
    Jia, Ruoyu
    Li, Sujian
    Liu, Jingming
    [J]. 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 4003 - 4015
  • [4] Improving Sequence-to-Sequence Pre-training via Sequence Span Rewriting
    Zhou, Wangchunshu
    Ge, Tao
    Xu, Canwen
    Xu, Ke
    Wei, Furu
    [J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 571 - 582
  • [5] VideoTRM: Pre-training for Video Captioning Challenge 2020
    Chen, Jingwen
    Chao, Hongyang
    [J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4605 - 4609
  • [6] Image Difference Captioning with Pre-training and Contrastive Learning
    Yao, Linli
    Wang, Weiying
    Jin, Qin
    [J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 3108 - 3116
  • [7] The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection
    Mettes, Pascal
    Koelma, Dennis C.
    Snoek, Cees G. M.
    [J]. ICMR'16: PROCEEDINGS OF THE 2016 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2016, : 175 - 182
  • [8] Event Camera Data Pre-training
    Yang, Yan
    Pan, Liyuan
    Liu, Liu
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 10665 - 10675
  • [9] Dynamic Scene Graph Generation via Anticipatory Pre-training
    Li, Yiming
    Yang, Xiaoshan
    Xu, Changsheng
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 13864 - 13873
  • [10] Conditional Embedding Pre-Training Language Model for Image Captioning
    Li, Pengfei
    Zhang, Min
    Lin, Peijie
    Wan, Jian
    Jiang, Ming
    [J]. NEURAL PROCESSING LETTERS, 2022, 54 (06) : 4987 - 5003