Improving Video Captioning with Temporal Composition of a Visual-Syntactic Embedding

被引:27
|
作者
Perez-Martin, Jesus [1 ]
Bustos, Benjamin [1 ]
Perez, Jorge [1 ]
机构
[1] Univ Chile, Dept Comp Sci, Santiago, Chile
关键词
TEXT;
D O I
10.1109/WACV48630.2021.00308
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video captioning is the task of predicting a semantic and syntactically correct sequence of words given some context video. The most successful methods for video captioning have a strong dependency on the effectiveness of semantic representations learned from visual models, but often produce syntactically incorrect sentences which harms their performance on standard datasets. In this paper, we address this limitation by considering syntactic representation learning as an essential component of video captioning. We construct a visual-syntactic embedding by mapping into a common vector space a visual representation, that depends only on the video, with a syntactic representation that depends only on Part-of-Speech (POS) tagging structures of the video description. We integrate this joint representation into an encoder-decoder architecture that we call Visual-Semantic-Syntactic Aligned Network (SemSynAN), which guides the decoder (text generation stage) by aligning temporal compositions of visual, semantic, and syntactic representations. We tested our proposed architecture obtaining state-of-the-art results on two widely used video captioning datasets: the Microsoft Video Description (MSVD) dataset and the Microsoft Research Video-to-Text (MSRVTT) dataset.
引用
收藏
页码:3038 / 3048
页数:11
相关论文
共 50 条
  • [41] Adaptively Converting Auxiliary Attributes and Textual Embedding for Video Captioning Based on BiLSTM
    Shuqin Chen
    Xian Zhong
    Lin Li
    Wenxuan Liu
    Cheng Gu
    Luo Zhong
    [J]. Neural Processing Letters, 2020, 52 : 2353 - 2369
  • [42] Adaptively Converting Auxiliary Attributes and Textual Embedding for Video Captioning Based on BiLSTM
    Chen, Shuqin
    Zhong, Xian
    Li, Lin
    Liu, Wenxuan
    Gu, Cheng
    Zhong, Luo
    [J]. NEURAL PROCESSING LETTERS, 2020, 52 (03) : 2353 - 2369
  • [43] Rich Visual and Language Representation with Complementary Semantics for Video Captioning
    Tang, Pengjie
    Wang, Hanli
    Li, Qinyu
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (02)
  • [44] Learning to enhance areal video captioning with visual question answering
    Al Mehmadi, Shima M.
    Bazi, Yakoub
    Al Rahhal, Mohamad M.
    Zuair, Mansour
    [J]. INTERNATIONAL JOURNAL OF REMOTE SENSING, 2024, 45 (18) : 6395 - 6407
  • [45] Improving Image Captioning through Visual and Semantic Mutual Promotion
    Zhang, Jing
    Xie, Yingshuai
    Liu, Xiaoqiang
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4716 - 4724
  • [46] Visual versus Textual Embedding for Video Retrieval
    Francis, Danny
    Pidou, Paul
    Merialdo, Bernard
    Huet, Benoit
    [J]. ADVANCED CONCEPTS FOR INTELLIGENT VISION SYSTEMS (ACIVS 2017), 2017, 10617 : 386 - 395
  • [47] Video Captioning via Sentence Augmentation and Spatio-Temporal Attention
    Chen, Tseng-Hung
    Zeng, Kuo-Hao
    Hsu, Wan-Ting
    Sun, Min
    [J]. COMPUTER VISION - ACCV 2016 WORKSHOPS, PT I, 2017, 10116 : 269 - 286
  • [48] Multi-scale features with temporal information guidance for video captioning
    Zhao, Hong
    Chen, Zhiwen
    Yang, Yi
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 137
  • [49] Keyphrase Extraction by Improving TextRank with an Integration of Word Embedding and Syntactic Information
    Zhang, Sheng
    Luo, Qi
    Feng, Yukun
    Ding, Ke
    Gifu, Daniela
    Zhang, Silan
    Ma, Xiaohang
    Xia, Jingbo
    [J]. Recent Advances in Computer Science and Communications, 2021, 14 (09) : 2969 - 2975
  • [50] Spatio-Temporal Ranked-Attention Networks for Video Captioning
    Cherian, Anoop
    Wang, Jue
    Hori, Chiori
    Marks, Tim K.
    [J]. 2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2020, : 1606 - 1615