Sequence in sequence for video captioning

被引：10

作者：

Wang, Huiyun ^{[1
]}

Gao, Chongyang ^{[1
]}

Han, Yahong ^{[1
]}

机构：

[1] Tianjin Univ, Sch Comp Sci & Technol, Tianjin 300350, Peoples R China

来源：

PATTERN RECOGNITION LETTERS | 2020年 / 130卷 / 130期

关键词：

Video captioning; Encoding; Decoding; Spatio-temporal representation;

D O I：

10.1016/j.patrec.2018.07.024

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

For video captioning, the words in the caption are closely related to an overall understanding of the video. Thus, a suitable representation for the video is rather important for the description. For more precise words in the task of video captioning, we aim to encode the video feature for current word at each time-stamp of the generation process. This paper proposes a new framework of 'Sequence in Sequence' to encode the sequential frames into a spatio-temporal representation at each time-stamp to utter a word and further distill most related visual content by an extra semantic loss. First, we aggregate the sequential frames to extract related visual content guided by last word, and get a representation with rich spatio-temporal information. Then, to decode the aggregated representation for a precise word, we leverage two layers of GRU structure, where the first layer further distills useful visual content based on an extra semantic loss and the second layer selects the correct word according to the distilled features. Experiments on two benchmark datasets demonstrate that our method outperforms the current state-ofthe-art methods on Bleu@4, METEOR and CIDEr metrics. (C) 2018 Elsevier B.V. All rights reserved.

引用

页码：327 / 334

页数：8

共 50 条

[1] Convolutional Reconstruction-to-Sequence for Video Captioning
Wu, Aming
Han, Yahong
Yang, Yi
Hu, Qinghua
Wu, Fei
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (11) : 4299 - 4308
[2] MAPS: Joint Multimodal Attention and POS Sequence Generation for Video Captioning
Zou, Cong
Wang, Xuchen
Hu, Yaosi
Chen, Zhenzhong
Liu, Shan
[J]. 2021 INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2021,
[3] Multimodal Deep Neural Network with Image Sequence Features for Video Captioning
Oura, Soichiro
Matsukawa, Tetsu
Suzuki, Einoshin
[J]. 2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,
[4] Controllable Video Captioning with POS Sequence Guidance Based on Gated Fusion Network
Wang, Bairui
Ma, Lin
Zhang, Wei
Jiang, Wenhao
Wang, Jingwen
Liu, Wei
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 2641 - 2650
[5] Sequence to Sequence - Video to Text
Venugopalan, Subhashini
Rohrbach, Marcus
Donahue, Jeff
Mooney, Raymond
Darrell, Trevor
Saenko, Kate
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4534 - 4542
[6] Video sequence matching
Mohan, R
[J]. PROCEEDINGS OF THE 1998 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-6, 1998, : 3697 - 3700
[7] Self-critical Sequence Training for Image Captioning
Rennie, Steven J.
Marcheret, Etienne
Mroueh, Youssef
Ross, Jerret
Goel, Vaibhava
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 1179 - 1195
[8] Sequence-to-Sequence Video Prediction by Learning Hierarchical Representations
Fan, Kun
Joung, Chungin
Baek, Seungjun
[J]. APPLIED SCIENCES-BASEL, 2020, 10 (22): : 1 - 14
[9] TRIPLE SEQUENCE GENERATIVE ADVERSARIAL NETS FOR UNSUPERVISED IMAGE CAPTIONING
Zhou, Yucheng
Tao, Wei
Zhang, Wenqiang
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7598 - 7602
[10] Enhancing Video Sequence in Video Analytics Systems
Golovin, O. M.
[J]. CYBERNETICS AND SYSTEMS ANALYSIS, 2024, 60 (03) : 496 - 510

← 1 2 3 4 5 →