Early Embedding and Late Reranking for Video Captioning

被引:48
|
作者
Dong, Jianfeng [1 ]
Li, Xirong [2 ]
Lan, Weiyu [2 ]
Huo, Yujia [2 ]
Snoek, Cees G. M. [3 ]
机构
[1] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou, Zhejiang, Peoples R China
[2] Renmin Univ China, Key Lab Data Engn & Knowledge Engn, Beijing, Peoples R China
[3] Univ Amsterdam, Intelligent Syst Lab Amsterdam, Amsterdam, Netherlands
关键词
Video captioning; MSR; Video to Language Challenge; Tag embedding; Sentence reranking;
D O I
10.1145/2964284.2984064
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper describes our solution for the MSR ideo to Language Challenge. We start from the popular ConvNet + LSTM model, which we extend with two novel modules. One is early embedding, which enriches the current low-level input to LSTM by tag embeddings. The other is late reranking, for re-scoring generated sentences in terms of their relevance to a specific video. The modules are inspired by recent works on image captioning, repurposed and redesigned for video. As experiments on the MSR-VTT validation set show, the joint use of these two modules add a. clear improvement over a non-trivial ConvNet + LSTM baseline under four performance metrics. The viability of the proposed solution is further confirmed by the blind test by the organizers. Our system is ranked at the 4th place in terms of overall performance, while scoring the best CIDEr-D, which measures the human-likeness of generated captions.
引用
收藏
页码:1082 / 1086
页数:5
相关论文
共 50 条
  • [1] Deep multimodal embedding for video captioning
    Jin Young Lee
    [J]. Multimedia Tools and Applications, 2019, 78 : 31793 - 31805
  • [2] Deep multimodal embedding for video captioning
    Lee, Jin Young
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (22) : 31793 - 31805
  • [3] Position embedding fusion on transformer for dense video captioning
    Yang, Sixuan
    Tang, Pengjie
    Wang, Hanli
    Li, Qinyu
    [J]. DEVELOPMENTS OF ARTIFICIAL INTELLIGENCE TECHNOLOGIES IN COMPUTATION AND ROBOTICS, 2020, 12 : 792 - 799
  • [4] Early and Late Combinations of Criteria for Reranking Distributional Thesauri
    Ferret, Olivier
    [J]. PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL) AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (IJCNLP), VOL 2, 2015, : 470 - 476
  • [5] Improving Video Captioning with Temporal Composition of a Visual-Syntactic Embedding
    Perez-Martin, Jesus
    Bustos, Benjamin
    Perez, Jorge
    [J]. 2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021, 2021, : 3038 - 3048
  • [6] VIDEO SEARCH RERANKING VIA ONLINE ORDINAL RERANKING
    Yang, Yi-Hsuan
    Hsu, Winston H.
    [J]. 2008 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-4, 2008, : 285 - 288
  • [7] Adaptively Converting Auxiliary Attributes and Textual Embedding for Video Captioning Based on BiLSTM
    Shuqin Chen
    Xian Zhong
    Lin Li
    Wenxuan Liu
    Cheng Gu
    Luo Zhong
    [J]. Neural Processing Letters, 2020, 52 : 2353 - 2369
  • [8] Adaptively Converting Auxiliary Attributes and Textual Embedding for Video Captioning Based on BiLSTM
    Chen, Shuqin
    Zhong, Xian
    Li, Lin
    Liu, Wenxuan
    Gu, Cheng
    Zhong, Luo
    [J]. NEURAL PROCESSING LETTERS, 2020, 52 (03) : 2353 - 2369
  • [9] Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video Captioning
    Dong, Shanshan
    Niu, Tianzi
    Luo, Xin
    Liu, Wu
    Xu, Xinshun
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (02)
  • [10] Dense Video Captioning With Early Linguistic Information Fusion
    Aafaq, Nayyer
    Mian, Ajmal
    Akhtar, Naveed
    Liu, Wei
    Shah, Mubarak
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2309 - 2322