Video Captioning with Tube Features

被引:0
|
作者
Zhao, Bin [1 ,2 ]
Li, Xuelong [3 ]
Lu, Xiaoqiang [3 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Xian 710072, Peoples R China
[2] Northwestern Polytech Univ, Ctr OPT IMagery Anal & Learning OPTIMAL, Xian 710072, Peoples R China
[3] Chinese Acad Sci, Xian Inst Opt & Precis Mech, Xian 710119, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual feature plays an important role in the video captioning task. Considering that the video content is mainly composed of the activities of salient objects, it has restricted the caption quality of current approaches which just focus on global frame features while paying less attention to the salient objects. To tackle this problem, in this paper, we design an object-aware feature for video captioning, denoted as tube feature. Firstly, Faster-RCNN is employed to extract object regions in frames, and a tube generation method is developed to connect the regions from different frames but belonging to the same object. After that, an encoder-decoder architecture is constructed for video caption generation. Specifically, the encoder is a bi-directional LSTM, which is utilized to capture the dynamic information of each tube. The decoder is a single LSTM extended with an attention model, which enables our approach to adaptively attend to the most correlated tubes when generating the caption. We evaluate our approach on two benchmark datasets: MSVD and Charades. The experimental results have demonstrated the effectiveness of tube feature in the video captioning task.
引用
收藏
页码:1177 / 1183
页数:7
相关论文
共 50 条
  • [21] Bilingual video captioning model for enhanced video retrieval
    Alrebdi, Norah
    Al-Shargabi, Amal A.
    [J]. JOURNAL OF BIG DATA, 2024, 11 (01)
  • [22] Watch It Twice: Video Captioning with a Refocused Video Encoder
    Shi, Xiangxi
    Cai, Jianfei
    Joty, Shafiq
    Gu, Jiuxiang
    [J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 818 - 826
  • [23] Incorporating the Graph Representation of Video and Text into Video Captioning
    Lu, Min
    Li, Yuan
    [J]. 2022 IEEE 34TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2022, : 396 - 401
  • [24] Video Interactive Captioning with Human Prompts
    Wu, Aming
    Han, Yahong
    Yang, Yi
    [J]. PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 961 - 967
  • [25] Deep multimodal embedding for video captioning
    Jin Young Lee
    [J]. Multimedia Tools and Applications, 2019, 78 : 31793 - 31805
  • [26] Accurate and Fast Compressed Video Captioning
    Shen, Yaojie
    Gu, Xin
    Xu, Kai
    Fan, Heng
    Wen, Longyin
    Zhang, Libo
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15512 - 15521
  • [27] A Deep Structured Model for Video Captioning
    Vinodhini, V.
    Sathiyabhama, B.
    Sankar, S.
    Somula, Ramasubbareddy
    [J]. INTERNATIONAL JOURNAL OF GAMING AND COMPUTER-MEDIATED SIMULATIONS, 2020, 12 (02) : 44 - 56
  • [28] Delving Deeper into the Decoder for Video Captioning
    Chen, Haoran
    Li, Jianmin
    Hu, Xiaolin
    [J]. ECAI 2020: 24TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, 325 : 1079 - 1086
  • [29] Hierarchical Modular Network for Video Captioning
    Ye, Hanhua
    Li, Guorong
    Qi, Yuankai
    Wang, Shuhui
    Huang, Qingming
    Yang, Ming-Hsuan
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17918 - 17927
  • [30] Evaluation metrics for video captioning: A survey
    Inacio, Andrei de Souza
    Lopes, Heitor Silverio
    [J]. MACHINE LEARNING WITH APPLICATIONS, 2023, 13