Watch It Twice: Video Captioning with a Refocused Video Encoder

被引:18
|
作者
Shi, Xiangxi [1 ]
Cai, Jianfei [1 ,2 ]
Joty, Shafiq [1 ]
Gu, Jiuxiang [1 ]
机构
[1] Nanyang Technol Univ, Singapore, Singapore
[2] Monash Univ, Clayton, Vic, Australia
基金
新加坡国家研究基金会;
关键词
video captioning; recurrent video encoding; reinforcement learning; key frame;
D O I
10.1145/3343031.3351060
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
With the rapid growth of video data and the increasing demands of various crossmodal applications such as intelligent video search and assistance towards visually-impaired people, video captioning task has received a lot of attention recently in computer vision and natural language processing fields. The state-of-the-art video captioning methods focus more on encoding the temporal information, while lacking effective ways to remove irrelevant temporal information and also neglecting the spatial details. In particular, the current unidirectional video encoder can be negatively affected by irrelevant temporal information, especially the irrelevant information at the beginning and at the end of a video. In addition, disregarding detailed spatial features may lead to incorrect word choices in decoding. In this paper, we propose a novel recurrent video encoding method and a novel visual spatial feature for the video captioning task. The recurrent encoding module encodes the video twice with a predicted key frame to avoid irrelevant temporal information often occurring at the beginning and at the end of a video. The novel spatial features represent spatial information from different regions of a video and provide the decoder with more detailed information. Experiments on two benchmark datasets show superior performance of the proposed method.
引用
收藏
页码:818 / 826
页数:9
相关论文
共 50 条
  • [41] Evaluation metrics for video captioning: A survey
    Inacio, Andrei de Souza
    Lopes, Heitor Silverio
    [J]. MACHINE LEARNING WITH APPLICATIONS, 2023, 13
  • [42] Hierarchical Modular Network for Video Captioning
    Ye, Hanhua
    Li, Guorong
    Qi, Yuankai
    Wang, Shuhui
    Huang, Qingming
    Yang, Ming-Hsuan
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17918 - 17927
  • [43] Delving Deeper into the Decoder for Video Captioning
    Chen, Haoran
    Li, Jianmin
    Hu, Xiaolin
    [J]. ECAI 2020: 24TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, 325 : 1079 - 1086
  • [44] Accurate and Fast Compressed Video Captioning
    Shen, Yaojie
    Gu, Xin
    Xu, Kai
    Fan, Heng
    Wen, Longyin
    Zhang, Libo
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15512 - 15521
  • [45] A Deep Structured Model for Video Captioning
    Vinodhini, V.
    Sathiyabhama, B.
    Sankar, S.
    Somula, Ramasubbareddy
    [J]. INTERNATIONAL JOURNAL OF GAMING AND COMPUTER-MEDIATED SIMULATIONS, 2020, 12 (02) : 44 - 56
  • [46] Deep multimodal embedding for video captioning
    Jin Young Lee
    [J]. Multimedia Tools and Applications, 2019, 78 : 31793 - 31805
  • [47] Video Captioning Time Stamp Calculation
    Guo, Yun
    [J]. ICSIT 2010: INTERNATIONAL CONFERENCE ON SOCIETY AND INFORMATION TECHNOLOGIES (POST-CONFERENCE EDITION), 2010, : 12 - 16
  • [48] Video Captioning with Transferred Semantic Attributes
    Pan, Yingwei
    Yao, Ting
    Li, Houqiang
    Mei, Tao
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 984 - 992
  • [49] Dense Video Captioning for Incomplete Videos
    Dang, Xuan
    Wang, Guolong
    Xiong, Kun
    Qin, Zheng
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2021, PT V, 2021, 12895 : 665 - 676
  • [50] Understanding temporal structure for video captioning
    Shagan Sah
    Thang Nguyen
    Ray Ptucha
    [J]. Pattern Analysis and Applications, 2020, 23 : 147 - 159