End-to-End Video Captioning

被引:12
|
作者
Olivastri, Silvio [1 ]
Singh, Gurkirt [2 ]
Cuzzolin, Fabio [2 ]
机构
[1] AI Labs, Bologna, Italy
[2] Oxford Brookes Univ, Oxford, England
关键词
D O I
10.1109/ICCVW.2019.00185
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Building correspondences across different modalities, such as video and language, has recently become critical in many visual recognition applications, such as video captioning. Inspired by machine translation, recent models tackle this task using an encoder-decoder strategy. The (video) encoder is traditionally a Convolutional Neural Network (CNN), while the decoding (for language generation) is done using a Recurrent Neural Network (RNN). Current state-of-the-art methods, however, train encoder and decoder separately. CNNs are pretrained on object and/or action recognition tasks and used to encode video-level features. The decoder is then optimised on such static features to generate the video's description. This disjoint setup is arguably sub-optimal for input (video) to output (description) mapping. In this work, we propose to optimise both encoder and decoder simultaneously in an end-to-end fashion. In a two stage training setting, we first initialise our architecture using pre-trained encoders and decoders then, the entire network is trained end-to-end in a fine-tuning stage to learn the most relevant features for video caption generation. In our experiments, we use GoogLeNet and Inception-ResNetv2 as encoders and an original Soft-Attention (SA-) LSTM as a decoder. Analogously to gains observed in other computer vision problems, we show that end-to-end training significantly improves over the traditional, disjoint training process. We evaluate our End-to-End (EtENet) Networks on the Microsoft Research Video Description (MSVD) and the MSR Video to Text (MSR-VTT) benchmark datasets, showing how EtENet achieves state-of-the-art performance across the board.
引用
收藏
页码:1474 / 1482
页数:9
相关论文
共 50 条
  • [1] End-to-End Dense Video Captioning with Masked Transformer
    Zhou, Luowei
    Zhou, Yingbo
    Corso, Jason J.
    Socher, Richard
    Xiong, Caiming
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 8739 - 8748
  • [2] End-to-End Dense Video Captioning with Parallel Decoding
    Wang, Teng
    Zhang, Ruimao
    Lu, Zhichao
    Zheng, Feng
    Cheng, Ran
    Luo, Ping
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6827 - 6837
  • [3] End-to-end Generative Pretraining for Multimodal Video Captioning
    Seo, Paul Hongsuck
    Nagrani, Arsha
    Arnab, Anurag
    Schmid, Cordelia
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17938 - 17947
  • [4] End-to-End Video Captioning with Multitask Reinforcement Learning
    Li, Lijun
    Gong, Boqing
    [J]. 2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 339 - 348
  • [5] SWINBERT: End-to-End Transformers with Sparse Attention for Video Captioning
    Lin, Kevin
    Li, Linjie
    Lin, Chung-Ching
    Ahmed, Faisal
    Gan, Zhe
    Liu, Zicheng
    Lu, Yumao
    Wang, Lijuan
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17928 - 17937
  • [6] End-to-End Dual-Stream Transformer with a Parallel Encoder for Video Captioning
    Ran, Yuting
    Fang, Bin
    Chen, Lei
    Wei, Xuekai
    Xian, Weizhi
    Zhou, Mingliang
    [J]. JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2024, 33 (04)
  • [7] An End-to-End Deep Learning Approach for Video Captioning Through Mobile Devices
    Pezzuto Damaceno, Rafael J.
    Cesar, Roberto M., Jr.
    [J]. PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS, CIARP 2023, PT I, 2024, 14469 : 715 - 729
  • [8] End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering
    Yu, Youngjae
    Ko, Hyungjin
    Choi, Jongwook
    Kim, Gunhee
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 3261 - 3269
  • [9] Video Caption Based Searching Using End-to-End Dense Captioning and Sentence Embeddings
    Aggarwal, Akshay
    Chauhan, Aniruddha
    Kumar, Deepika
    Mittal, Mamta
    Roy, Sudipta
    Kim, Tai-hoon
    [J]. SYMMETRY-BASEL, 2020, 12 (06):
  • [10] End-to-End Video Captioning Based on Multiview Semantic Alignment for Human-Machine Fusion
    Wu, Shuai
    Gao, Yubing
    Yang, Weidong
    Li, Hongkai
    Zhu, Guangyu
    [J]. IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, 2024, : 1 - 9