End-to-End Video Captioning

被引：12

作者：

Olivastri, Silvio ^{[1
]}

Singh, Gurkirt ^{[2
]}

Cuzzolin, Fabio ^{[2
]}

机构：

[1] AI Labs, Bologna, Italy

[2] Oxford Brookes Univ, Oxford, England

来源：

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW) | 2019年

关键词：

D O I：

10.1109/ICCVW.2019.00185

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Building correspondences across different modalities, such as video and language, has recently become critical in many visual recognition applications, such as video captioning. Inspired by machine translation, recent models tackle this task using an encoder-decoder strategy. The (video) encoder is traditionally a Convolutional Neural Network (CNN), while the decoding (for language generation) is done using a Recurrent Neural Network (RNN). Current state-of-the-art methods, however, train encoder and decoder separately. CNNs are pretrained on object and/or action recognition tasks and used to encode video-level features. The decoder is then optimised on such static features to generate the video's description. This disjoint setup is arguably sub-optimal for input (video) to output (description) mapping. In this work, we propose to optimise both encoder and decoder simultaneously in an end-to-end fashion. In a two stage training setting, we first initialise our architecture using pre-trained encoders and decoders then, the entire network is trained end-to-end in a fine-tuning stage to learn the most relevant features for video caption generation. In our experiments, we use GoogLeNet and Inception-ResNetv2 as encoders and an original Soft-Attention (SA-) LSTM as a decoder. Analogously to gains observed in other computer vision problems, we show that end-to-end training significantly improves over the traditional, disjoint training process. We evaluate our End-to-End (EtENet) Networks on the Microsoft Research Video Description (MSVD) and the MSR Video to Text (MSR-VTT) benchmark datasets, showing how EtENet achieves state-of-the-art performance across the board.

引用

页码：1474 / 1482

页数：9

共 50 条

[1] End-to-End Dense Video Captioning with Masked Transformer
Zhou, Luowei
Zhou, Yingbo
Corso, Jason J.
Socher, Richard
Xiong, Caiming
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 8739 - 8748
[2] End-to-End Dense Video Captioning with Parallel Decoding
Wang, Teng
Zhang, Ruimao
Lu, Zhichao
Zheng, Feng
Cheng, Ran
Luo, Ping
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6827 - 6837
[3] End-to-end Generative Pretraining for Multimodal Video Captioning
Seo, Paul Hongsuck
Nagrani, Arsha
Arnab, Anurag
Schmid, Cordelia
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17938 - 17947
[4] End-to-End Video Captioning with Multitask Reinforcement Learning
Li, Lijun
Gong, Boqing
[J]. 2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 339 - 348
[5] SWINBERT: End-to-End Transformers with Sparse Attention for Video Captioning
Lin, Kevin
Li, Linjie
Lin, Chung-Ching
Ahmed, Faisal
Gan, Zhe
Liu, Zicheng
Lu, Yumao
Wang, Lijuan
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17928 - 17937
[6] End-to-End Dual-Stream Transformer with a Parallel Encoder for Video Captioning
Ran, Yuting
Fang, Bin
Chen, Lei
Wei, Xuekai
Xian, Weizhi
Zhou, Mingliang
[J]. JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2024, 33 (04)
[7] An End-to-End Deep Learning Approach for Video Captioning Through Mobile Devices
Pezzuto Damaceno, Rafael J.
Cesar, Roberto M., Jr.
[J]. PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS, CIARP 2023, PT I, 2024, 14469 : 715 - 729
[8] End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering
Yu, Youngjae
Ko, Hyungjin
Choi, Jongwook
Kim, Gunhee
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 3261 - 3269
[9] Video Caption Based Searching Using End-to-End Dense Captioning and Sentence Embeddings
Aggarwal, Akshay
Chauhan, Aniruddha
Kumar, Deepika
Mittal, Mamta
Roy, Sudipta
Kim, Tai-hoon
[J]. SYMMETRY-BASEL, 2020, 12 (06):
[10] End-to-End Video Captioning Based on Multiview Semantic Alignment for Human-Machine Fusion
Wu, Shuai
Gao, Yubing
Yang, Weidong
Li, Hongkai
Zhu, Guangyu
[J]. IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, 2024, : 1 - 9

← 1 2 3 4 5 →