Early Embedding and Late Reranking for Video Captioning

被引：48

作者：

Dong, Jianfeng ^{[1
]}

Li, Xirong ^{[2
]}

Lan, Weiyu ^{[2
]}

Huo, Yujia ^{[2
]}

Snoek, Cees G. M. ^{[3
]}

机构：

[1] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou, Zhejiang, Peoples R China

[2] Renmin Univ China, Key Lab Data Engn & Knowledge Engn, Beijing, Peoples R China

[3] Univ Amsterdam, Intelligent Syst Lab Amsterdam, Amsterdam, Netherlands

来源：

MM'16: PROCEEDINGS OF THE 2016 ACM MULTIMEDIA CONFERENCE | 2016年

关键词：

Video captioning; MSR; Video to Language Challenge; Tag embedding; Sentence reranking;

D O I：

10.1145/2964284.2984064

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper describes our solution for the MSR ideo to Language Challenge. We start from the popular ConvNet + LSTM model, which we extend with two novel modules. One is early embedding, which enriches the current low-level input to LSTM by tag embeddings. The other is late reranking, for re-scoring generated sentences in terms of their relevance to a specific video. The modules are inspired by recent works on image captioning, repurposed and redesigned for video. As experiments on the MSR-VTT validation set show, the joint use of these two modules add a. clear improvement over a non-trivial ConvNet + LSTM baseline under four performance metrics. The viability of the proposed solution is further confirmed by the blind test by the organizers. Our system is ranked at the 4th place in terms of overall performance, while scoring the best CIDEr-D, which measures the human-likeness of generated captions.

引用

页码：1082 / 1086

页数：5

共 50 条

[31] Bilingual video captioning model for enhanced video retrieval
Alrebdi, Norah
Al-Shargabi, Amal A.
[J]. JOURNAL OF BIG DATA, 2024, 11 (01)
[32] From Video to Language: Survey of Video Captioning and Description
Tang, Peng-Jie
Wang, Han-Li
[J]. Zidonghua Xuebao/Acta Automatica Sinica, 2022, 48 (02): : 375 - 397
[33] Incorporating the Graph Representation of Video and Text into Video Captioning
Lu, Min
Li, Yuan
[J]. 2022 IEEE 34TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2022, : 396 - 401
[34] Watch It Twice: Video Captioning with a Refocused Video Encoder
Shi, Xiangxi
Cai, Jianfei
Joty, Shafiq
Gu, Jiuxiang
[J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 818 - 826
[35] Sparse Transfer Learning for Interactive Video Search Reranking
Tian, Xinmei
Tao, Dacheng
Rui, Yong
[J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2012, 8 (03) : 1 - 19
[36] Video Interactive Captioning with Human Prompts
Wu, Aming
Han, Yahong
Yang, Yi
[J]. PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 961 - 967
[37] Accurate and Fast Compressed Video Captioning
Shen, Yaojie
Gu, Xin
Xu, Kai
Fan, Heng
Wen, Longyin
Zhang, Libo
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15512 - 15521
[38] A Deep Structured Model for Video Captioning
Vinodhini, V.
Sathiyabhama, B.
Sankar, S.
Somula, Ramasubbareddy
[J]. INTERNATIONAL JOURNAL OF GAMING AND COMPUTER-MEDIATED SIMULATIONS, 2020, 12 (02) : 44 - 56
[39] Semantic Grouping Network for Video Captioning
Ryu, Hobin
Kang, Sunghun
Kang, Haeyong
Yoo, Chang D.
[J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2514 - 2522
[40] Hierarchical Modular Network for Video Captioning
Ye, Hanhua
Li, Guorong
Qi, Yuankai
Wang, Shuhui
Huang, Qingming
Yang, Ming-Hsuan
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17918 - 17927

← 1 2 3 4 5 →