Video Captioning with Tube Features

被引:0
|
作者
Zhao, Bin [1 ,2 ]
Li, Xuelong [3 ]
Lu, Xiaoqiang [3 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Xian 710072, Peoples R China
[2] Northwestern Polytech Univ, Ctr OPT IMagery Anal & Learning OPTIMAL, Xian 710072, Peoples R China
[3] Chinese Acad Sci, Xian Inst Opt & Precis Mech, Xian 710119, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual feature plays an important role in the video captioning task. Considering that the video content is mainly composed of the activities of salient objects, it has restricted the caption quality of current approaches which just focus on global frame features while paying less attention to the salient objects. To tackle this problem, in this paper, we design an object-aware feature for video captioning, denoted as tube feature. Firstly, Faster-RCNN is employed to extract object regions in frames, and a tube generation method is developed to connect the regions from different frames but belonging to the same object. After that, an encoder-decoder architecture is constructed for video caption generation. Specifically, the encoder is a bi-directional LSTM, which is utilized to capture the dynamic information of each tube. The decoder is a single LSTM extended with an attention model, which enables our approach to adaptively attend to the most correlated tubes when generating the caption. We evaluate our approach on two benchmark datasets: MSVD and Charades. The experimental results have demonstrated the effectiveness of tube feature in the video captioning task.
引用
收藏
页码:1177 / 1183
页数:7
相关论文
共 50 条
  • [1] Video Captioning with Visual and Semantic Features
    Lee, Sujin
    Kim, Incheol
    [J]. JOURNAL OF INFORMATION PROCESSING SYSTEMS, 2018, 14 (06): : 1318 - 1330
  • [2] Learning deep spatiotemporal features for video captioning
    Daskalakis, Eleftherios
    Tzelepi, Maria
    Tefas, Anastasios
    [J]. PATTERN RECOGNITION LETTERS, 2018, 116 : 143 - 149
  • [3] Multi-scale features with temporal information guidance for video captioning
    Zhao, Hong
    Chen, Zhiwen
    Yang, Yi
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 137
  • [4] Multimodal Deep Neural Network with Image Sequence Features for Video Captioning
    Oura, Soichiro
    Matsukawa, Tetsu
    Suzuki, Einoshin
    [J]. 2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,
  • [5] Video Captioning based on Image Captioning as Subsidiary Content
    Vaishnavi, J.
    Narmatha, V
    [J]. 2022 SECOND INTERNATIONAL CONFERENCE ON ADVANCES IN ELECTRICAL, COMPUTING, COMMUNICATION AND SUSTAINABLE TECHNOLOGIES (ICAECT), 2022,
  • [6] Video Captioning of Future Frames
    Hosseinzadeh, Mehrdad
    Wang, Yang
    [J]. 2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2021), 2021, : 979 - 988
  • [7] A Review Of Video Captioning Methods
    Mahajan, Dewarthi
    Bhosale, Sakshi
    Nighot, Yash
    Tayal, Madhuri
    [J]. INTERNATIONAL JOURNAL OF NEXT-GENERATION COMPUTING, 2021, 12 (05): : 708 - 715
  • [8] Multirate Multimodal Video Captioning
    Yang, Ziwei
    Xu, Youjiang
    Wang, Huiyun
    Wang, Bo
    Han, Yahong
    [J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1877 - 1882
  • [9] Video Captioning with Listwise Supervision
    Liu, Yuan
    Li, Xue
    Shi, Zhongchao
    [J]. THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 4197 - 4203
  • [10] Sequence in sequence for video captioning
    Wang, Huiyun
    Gao, Chongyang
    Han, Yahong
    [J]. PATTERN RECOGNITION LETTERS, 2020, 130 : 327 - 334