Watch It Twice: Video Captioning with a Refocused Video Encoder

被引:18
|
作者
Shi, Xiangxi [1 ]
Cai, Jianfei [1 ,2 ]
Joty, Shafiq [1 ]
Gu, Jiuxiang [1 ]
机构
[1] Nanyang Technol Univ, Singapore, Singapore
[2] Monash Univ, Clayton, Vic, Australia
基金
新加坡国家研究基金会;
关键词
video captioning; recurrent video encoding; reinforcement learning; key frame;
D O I
10.1145/3343031.3351060
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
With the rapid growth of video data and the increasing demands of various crossmodal applications such as intelligent video search and assistance towards visually-impaired people, video captioning task has received a lot of attention recently in computer vision and natural language processing fields. The state-of-the-art video captioning methods focus more on encoding the temporal information, while lacking effective ways to remove irrelevant temporal information and also neglecting the spatial details. In particular, the current unidirectional video encoder can be negatively affected by irrelevant temporal information, especially the irrelevant information at the beginning and at the end of a video. In addition, disregarding detailed spatial features may lead to incorrect word choices in decoding. In this paper, we propose a novel recurrent video encoding method and a novel visual spatial feature for the video captioning task. The recurrent encoding module encodes the video twice with a predicted key frame to avoid irrelevant temporal information often occurring at the beginning and at the end of a video. The novel spatial features represent spatial information from different regions of a video and provide the decoder with more detailed information. Experiments on two benchmark datasets show superior performance of the proposed method.
引用
收藏
页码:818 / 826
页数:9
相关论文
共 50 条
  • [1] SibNet: Sibling Convolutional Encoder for Video Captioning
    Liu, Sheng
    Ren, Zhou
    Yuan, Junsong
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (09) : 3259 - 3272
  • [2] SibNet: Sibling Convolutional Encoder for Video Captioning
    Liu, Sheng
    Ren, Zhou
    Yuan, Junsong
    [J]. PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 1425 - 1434
  • [3] Refocused Attention: Long Short-Term Rewards Guided Video Captioning
    Dong, Jiarong
    Gao, Ke
    Chen, Xiaokai
    Cao, Juan
    [J]. NEURAL PROCESSING LETTERS, 2020, 52 (02) : 935 - 948
  • [4] Refocused Attention: Long Short-Term Rewards Guided Video Captioning
    Jiarong Dong
    Ke Gao
    Xiaokai Chen
    Juan Cao
    [J]. Neural Processing Letters, 2020, 52 : 935 - 948
  • [5] Hierarchical Boundary-Aware Neural Encoder for Video Captioning
    Baraldi, Lorenzo
    Grana, Costantino
    Cucchiara, Rita
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 3185 - 3194
  • [6] Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning
    Pan, Pingbo
    Xu, Zhongwen
    Yang, Yi
    Wu, Fei
    Zhuang, Yueting
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 1029 - 1038
  • [7] Multi-Task Video Captioning with a Stepwise Multimodal Encoder
    Liu, Zihao
    Wu, Xiaoyu
    Yu, Ying
    [J]. ELECTRONICS, 2022, 11 (17)
  • [8] Boundary Detector Encoder and Decoder with Soft Attention for Video Captioning
    Chen, Tangming
    Zhao, Qike
    Song, Jingkuan
    [J]. WEB AND BIG DATA, APWEB-WAIM 2019, 2019, 11809 : 105 - 115
  • [9] Semantic Enhanced Encoder-Decoder Network (SEN) for Video Captioning
    Gui, Yuling
    Guo, Dan
    Zhao, Ye
    [J]. PROCEEDINGS OF THE 2ND WORKSHOP ON MULTIMEDIA FOR ACCESSIBLE HUMAN COMPUTER INTERFACES (MAHCI '19), 2019, : 25 - 32
  • [10] Retrieval Augmented Convolutional Encoder-decoder Networks for Video Captioning
    Chen, Jingwen
    Pan, Yingwei
    Li, Yehao
    Yao, Ting
    Chao, Hongyang
    Mei, Tao
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (01)