Early Embedding and Late Reranking for Video Captioning

被引：48

作者：

Dong, Jianfeng ^{[1
]}

Li, Xirong ^{[2
]}

Lan, Weiyu ^{[2
]}

Huo, Yujia ^{[2
]}

Snoek, Cees G. M. ^{[3
]}

机构：

[1] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou, Zhejiang, Peoples R China

[2] Renmin Univ China, Key Lab Data Engn & Knowledge Engn, Beijing, Peoples R China

[3] Univ Amsterdam, Intelligent Syst Lab Amsterdam, Amsterdam, Netherlands

来源：

MM'16: PROCEEDINGS OF THE 2016 ACM MULTIMEDIA CONFERENCE | 2016年

关键词：

Video captioning; MSR; Video to Language Challenge; Tag embedding; Sentence reranking;

D O I：

10.1145/2964284.2984064

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper describes our solution for the MSR ideo to Language Challenge. We start from the popular ConvNet + LSTM model, which we extend with two novel modules. One is early embedding, which enriches the current low-level input to LSTM by tag embeddings. The other is late reranking, for re-scoring generated sentences in terms of their relevance to a specific video. The modules are inspired by recent works on image captioning, repurposed and redesigned for video. As experiments on the MSR-VTT validation set show, the joint use of these two modules add a. clear improvement over a non-trivial ConvNet + LSTM baseline under four performance metrics. The viability of the proposed solution is further confirmed by the blind test by the organizers. Our system is ranked at the 4th place in terms of overall performance, while scoring the best CIDEr-D, which measures the human-likeness of generated captions.

引用

页码：1082 / 1086

页数：5

共 50 条

[1] Deep multimodal embedding for video captioning
Jin Young Lee
[J]. Multimedia Tools and Applications, 2019, 78 : 31793 - 31805
[2] Deep multimodal embedding for video captioning
Lee, Jin Young
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (22) : 31793 - 31805
[3] Position embedding fusion on transformer for dense video captioning
Yang, Sixuan
Tang, Pengjie
Wang, Hanli
Li, Qinyu
[J]. DEVELOPMENTS OF ARTIFICIAL INTELLIGENCE TECHNOLOGIES IN COMPUTATION AND ROBOTICS, 2020, 12 : 792 - 799
[4] Early and Late Combinations of Criteria for Reranking Distributional Thesauri
Ferret, Olivier
[J]. PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL) AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (IJCNLP), VOL 2, 2015, : 470 - 476
[5] Improving Video Captioning with Temporal Composition of a Visual-Syntactic Embedding
Perez-Martin, Jesus
Bustos, Benjamin
Perez, Jorge
[J]. 2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021, 2021, : 3038 - 3048
[6] VIDEO SEARCH RERANKING VIA ONLINE ORDINAL RERANKING
Yang, Yi-Hsuan
Hsu, Winston H.
[J]. 2008 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-4, 2008, : 285 - 288
[7] Adaptively Converting Auxiliary Attributes and Textual Embedding for Video Captioning Based on BiLSTM
Shuqin Chen
Xian Zhong
Lin Li
Wenxuan Liu
Cheng Gu
Luo Zhong
[J]. Neural Processing Letters, 2020, 52 : 2353 - 2369
[8] Adaptively Converting Auxiliary Attributes and Textual Embedding for Video Captioning Based on BiLSTM
Chen, Shuqin
Zhong, Xian
Li, Lin
Liu, Wenxuan
Gu, Cheng
Zhong, Luo
[J]. NEURAL PROCESSING LETTERS, 2020, 52 (03) : 2353 - 2369
[9] Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video Captioning
Dong, Shanshan
Niu, Tianzi
Luo, Xin
Liu, Wu
Xu, Xinshun
[J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (02)
[10] Dense Video Captioning With Early Linguistic Information Fusion
Aafaq, Nayyer
Mian, Ajmal
Akhtar, Naveed
Liu, Wei
Shah, Mubarak
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2309 - 2322

← 1 2 3 4 5 →