Video Captioning with Tube Features

被引:0
|
作者
Zhao, Bin [1 ,2 ]
Li, Xuelong [3 ]
Lu, Xiaoqiang [3 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Xian 710072, Peoples R China
[2] Northwestern Polytech Univ, Ctr OPT IMagery Anal & Learning OPTIMAL, Xian 710072, Peoples R China
[3] Chinese Acad Sci, Xian Inst Opt & Precis Mech, Xian 710119, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual feature plays an important role in the video captioning task. Considering that the video content is mainly composed of the activities of salient objects, it has restricted the caption quality of current approaches which just focus on global frame features while paying less attention to the salient objects. To tackle this problem, in this paper, we design an object-aware feature for video captioning, denoted as tube feature. Firstly, Faster-RCNN is employed to extract object regions in frames, and a tube generation method is developed to connect the regions from different frames but belonging to the same object. After that, an encoder-decoder architecture is constructed for video caption generation. Specifically, the encoder is a bi-directional LSTM, which is utilized to capture the dynamic information of each tube. The decoder is a single LSTM extended with an attention model, which enables our approach to adaptively attend to the most correlated tubes when generating the caption. We evaluate our approach on two benchmark datasets: MSVD and Charades. The experimental results have demonstrated the effectiveness of tube feature in the video captioning task.
引用
收藏
页码:1177 / 1183
页数:7
相关论文
共 50 条
  • [41] Semantic guidance network for video captioning
    Lan Guo
    Hong Zhao
    ZhiWen Chen
    ZeYu Han
    [J]. Scientific Reports, 13
  • [42] An Efficient Framework for Dense Video Captioning
    Suin, Maitreya
    Rajagopalan, A. N.
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 12039 - 12046
  • [43] Multimodal Pretraining for Dense Video Captioning
    Huang, Gabriel
    Pang, Bo
    Zhu, Zhenhai
    Rivera, Clara
    Soricut, Radu
    [J]. 1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 470 - 490
  • [44] Hard Contrastive Learning for Video Captioning
    Wu, Lilei
    Liu, Jie
    [J]. 2022 IEEE 5TH INTERNATIONAL CONFERENCE ON ELECTRONICS AND COMMUNICATION ENGINEERING, ICECE, 2022, : 202 - 209
  • [45] Multi-Perspective Video Captioning
    Bin, Yi
    Shang, Xindi
    Peng, Bo
    Ding, Yujuan
    Chua, Tat-Seng
    [J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5110 - 5118
  • [46] METEOR Guided Divergence for Video Captioning
    Rothenpieler, Daniel
    Amiriparian, Shahin
    [J]. 2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [47] A Tale of Two Modalities for Video Captioning
    Joshi, Pankaj
    Saharia, Chitwan
    Singh, Vishwajeet
    Gautam, Digvijaysingh
    Ramakrishnan, Ganesh
    Jyothi, Preethi
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 3708 - 3712
  • [48] Deep multimodal embedding for video captioning
    Lee, Jin Young
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (22) : 31793 - 31805
  • [49] Adversarial Video Captioning Regular Paper
    Adari, Suman K.
    Garcia, Washington
    Butler, Kevin
    [J]. 2019 49TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS WORKSHOPS (DSN-W), 2019, : 24 - 27
  • [50] Guidance Module Network for Video Captioning
    Zhang, Xiao
    Liu, Chunsheng
    Chang, Faliang
    [J]. 2021 PROCEEDINGS OF THE 40TH CHINESE CONTROL CONFERENCE (CCC), 2021, : 7955 - 7959