Video Captioning with Tube Features

被引：0

作者：

Zhao, Bin ^{[1
,2
]}

Li, Xuelong ^{[3
]}

Lu, Xiaoqiang ^{[3
]}

机构：

[1] Northwestern Polytech Univ, Sch Comp Sci, Xian 710072, Peoples R China

[2] Northwestern Polytech Univ, Ctr OPT IMagery Anal & Learning OPTIMAL, Xian 710072, Peoples R China

[3] Chinese Acad Sci, Xian Inst Opt & Precis Mech, Xian 710119, Peoples R China

来源：

PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE | 2018年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual feature plays an important role in the video captioning task. Considering that the video content is mainly composed of the activities of salient objects, it has restricted the caption quality of current approaches which just focus on global frame features while paying less attention to the salient objects. To tackle this problem, in this paper, we design an object-aware feature for video captioning, denoted as tube feature. Firstly, Faster-RCNN is employed to extract object regions in frames, and a tube generation method is developed to connect the regions from different frames but belonging to the same object. After that, an encoder-decoder architecture is constructed for video caption generation. Specifically, the encoder is a bi-directional LSTM, which is utilized to capture the dynamic information of each tube. The decoder is a single LSTM extended with an attention model, which enables our approach to adaptively attend to the most correlated tubes when generating the caption. We evaluate our approach on two benchmark datasets: MSVD and Charades. The experimental results have demonstrated the effectiveness of tube feature in the video captioning task.

引用

页码：1177 / 1183

页数：7

共 50 条

[21] Bilingual video captioning model for enhanced video retrieval
Alrebdi, Norah
Al-Shargabi, Amal A.
[J]. JOURNAL OF BIG DATA, 2024, 11 (01)
[22] Watch It Twice: Video Captioning with a Refocused Video Encoder
Shi, Xiangxi
Cai, Jianfei
Joty, Shafiq
Gu, Jiuxiang
[J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 818 - 826
[23] Incorporating the Graph Representation of Video and Text into Video Captioning
Lu, Min
Li, Yuan
[J]. 2022 IEEE 34TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2022, : 396 - 401
[24] Video Interactive Captioning with Human Prompts
Wu, Aming
Han, Yahong
Yang, Yi
[J]. PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 961 - 967
[25] Deep multimodal embedding for video captioning
Jin Young Lee
[J]. Multimedia Tools and Applications, 2019, 78 : 31793 - 31805
[26] Accurate and Fast Compressed Video Captioning
Shen, Yaojie
Gu, Xin
Xu, Kai
Fan, Heng
Wen, Longyin
Zhang, Libo
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15512 - 15521
[27] A Deep Structured Model for Video Captioning
Vinodhini, V.
Sathiyabhama, B.
Sankar, S.
Somula, Ramasubbareddy
[J]. INTERNATIONAL JOURNAL OF GAMING AND COMPUTER-MEDIATED SIMULATIONS, 2020, 12 (02) : 44 - 56
[28] Delving Deeper into the Decoder for Video Captioning
Chen, Haoran
Li, Jianmin
Hu, Xiaolin
[J]. ECAI 2020: 24TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, 325 : 1079 - 1086
[29] Hierarchical Modular Network for Video Captioning
Ye, Hanhua
Li, Guorong
Qi, Yuankai
Wang, Shuhui
Huang, Qingming
Yang, Ming-Hsuan
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17918 - 17927
[30] Evaluation metrics for video captioning: A survey
Inacio, Andrei de Souza
Lopes, Heitor Silverio
[J]. MACHINE LEARNING WITH APPLICATIONS, 2023, 13

← 1 2 3 4 5 →