Sequence in sequence for video captioning

被引:10
|
作者
Wang, Huiyun [1 ]
Gao, Chongyang [1 ]
Han, Yahong [1 ]
机构
[1] Tianjin Univ, Sch Comp Sci & Technol, Tianjin 300350, Peoples R China
关键词
Video captioning; Encoding; Decoding; Spatio-temporal representation;
D O I
10.1016/j.patrec.2018.07.024
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
For video captioning, the words in the caption are closely related to an overall understanding of the video. Thus, a suitable representation for the video is rather important for the description. For more precise words in the task of video captioning, we aim to encode the video feature for current word at each time-stamp of the generation process. This paper proposes a new framework of 'Sequence in Sequence' to encode the sequential frames into a spatio-temporal representation at each time-stamp to utter a word and further distill most related visual content by an extra semantic loss. First, we aggregate the sequential frames to extract related visual content guided by last word, and get a representation with rich spatio-temporal information. Then, to decode the aggregated representation for a precise word, we leverage two layers of GRU structure, where the first layer further distills useful visual content based on an extra semantic loss and the second layer selects the correct word according to the distilled features. Experiments on two benchmark datasets demonstrate that our method outperforms the current state-ofthe-art methods on Bleu@4, METEOR and CIDEr metrics. (C) 2018 Elsevier B.V. All rights reserved.
引用
收藏
页码:327 / 334
页数:8
相关论文
共 50 条
  • [1] Convolutional Reconstruction-to-Sequence for Video Captioning
    Wu, Aming
    Han, Yahong
    Yang, Yi
    Hu, Qinghua
    Wu, Fei
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (11) : 4299 - 4308
  • [2] MAPS: Joint Multimodal Attention and POS Sequence Generation for Video Captioning
    Zou, Cong
    Wang, Xuchen
    Hu, Yaosi
    Chen, Zhenzhong
    Liu, Shan
    [J]. 2021 INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2021,
  • [3] Multimodal Deep Neural Network with Image Sequence Features for Video Captioning
    Oura, Soichiro
    Matsukawa, Tetsu
    Suzuki, Einoshin
    [J]. 2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,
  • [4] Controllable Video Captioning with POS Sequence Guidance Based on Gated Fusion Network
    Wang, Bairui
    Ma, Lin
    Zhang, Wei
    Jiang, Wenhao
    Wang, Jingwen
    Liu, Wei
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 2641 - 2650
  • [5] Sequence to Sequence - Video to Text
    Venugopalan, Subhashini
    Rohrbach, Marcus
    Donahue, Jeff
    Mooney, Raymond
    Darrell, Trevor
    Saenko, Kate
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4534 - 4542
  • [6] Video sequence matching
    Mohan, R
    [J]. PROCEEDINGS OF THE 1998 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-6, 1998, : 3697 - 3700
  • [7] Self-critical Sequence Training for Image Captioning
    Rennie, Steven J.
    Marcheret, Etienne
    Mroueh, Youssef
    Ross, Jerret
    Goel, Vaibhava
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 1179 - 1195
  • [8] Sequence-to-Sequence Video Prediction by Learning Hierarchical Representations
    Fan, Kun
    Joung, Chungin
    Baek, Seungjun
    [J]. APPLIED SCIENCES-BASEL, 2020, 10 (22): : 1 - 14
  • [9] TRIPLE SEQUENCE GENERATIVE ADVERSARIAL NETS FOR UNSUPERVISED IMAGE CAPTIONING
    Zhou, Yucheng
    Tao, Wei
    Zhang, Wenqiang
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7598 - 7602
  • [10] Enhancing Video Sequence in Video Analytics Systems
    Golovin, O. M.
    [J]. CYBERNETICS AND SYSTEMS ANALYSIS, 2024, 60 (03) : 496 - 510