Improving Video Captioning with Temporal Composition of a Visual-Syntactic Embedding

被引:27
|
作者
Perez-Martin, Jesus [1 ]
Bustos, Benjamin [1 ]
Perez, Jorge [1 ]
机构
[1] Univ Chile, Dept Comp Sci, Santiago, Chile
关键词
TEXT;
D O I
10.1109/WACV48630.2021.00308
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video captioning is the task of predicting a semantic and syntactically correct sequence of words given some context video. The most successful methods for video captioning have a strong dependency on the effectiveness of semantic representations learned from visual models, but often produce syntactically incorrect sentences which harms their performance on standard datasets. In this paper, we address this limitation by considering syntactic representation learning as an essential component of video captioning. We construct a visual-syntactic embedding by mapping into a common vector space a visual representation, that depends only on the video, with a syntactic representation that depends only on Part-of-Speech (POS) tagging structures of the video description. We integrate this joint representation into an encoder-decoder architecture that we call Visual-Semantic-Syntactic Aligned Network (SemSynAN), which guides the decoder (text generation stage) by aligning temporal compositions of visual, semantic, and syntactic representations. We tested our proposed architecture obtaining state-of-the-art results on two widely used video captioning datasets: the Microsoft Video Description (MSVD) dataset and the Microsoft Research Video-to-Text (MSRVTT) dataset.
引用
收藏
页码:3038 / 3048
页数:11
相关论文
共 50 条
  • [31] STAT: Spatial-Temporal Attention Mechanism for Video Captioning
    Yan, Chenggang
    Tu, Yunbin
    Wang, Xingzheng
    Zhang, Yongbing
    Hao, Xinhong
    Zhang, Yongdong
    Dai, Qionghai
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (01) : 229 - 241
  • [32] Hierarchical Global-Local Temporal Modeling for Video Captioning
    Hu, Yaosi
    Chen, Zhenzhong
    Zha, Zheng-Jun
    Wu, Feng
    [J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 774 - 783
  • [33] Diverse Video Captioning by Adaptive Spatio-temporal Attention
    Ghaderi, Zohreh
    Salewski, Leonard
    Lensch, Hendrik P. A.
    [J]. PATTERN RECOGNITION, DAGM GCPR 2022, 2022, 13485 : 409 - 425
  • [34] Video Captioning Based on the Spatial-Temporal Saliency Tracing
    Zhou, Yuanen
    Hu, Zhenzhen
    Liu, Xueliang
    Wang, Meng
    [J]. ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT I, 2018, 11164 : 59 - 70
  • [35] Exploiting long-term temporal dynamics for video captioning
    Yuyu Guo
    Jingqiu Zhang
    Lianli Gao
    [J]. World Wide Web, 2019, 22 : 735 - 749
  • [36] Exploiting long-term temporal dynamics for video captioning
    Guo, Yuyu
    Zhang, Jingqiu
    Gao, Lianli
    [J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2019, 22 (02): : 735 - 749
  • [37] Exploring the Spatio-Temporal Aware Graph for video captioning
    Xue, Ping
    Zhou, Bing
    [J]. IET COMPUTER VISION, 2022, 16 (05) : 456 - 467
  • [38] Spatio-Temporal Attention Models for Grounded Video Captioning
    Zanfir, Mihai
    Marinoiu, Elisabeta
    Sminchisescu, Cristian
    [J]. COMPUTER VISION - ACCV 2016, PT IV, 2017, 10114 : 104 - 119
  • [39] Fused GRU with semantic-temporal attention for video captioning
    Gao, Lianli
    Wang, Xuanhan
    Song, Jingkuan
    Liu, Yang
    [J]. NEUROCOMPUTING, 2020, 395 (395) : 222 - 228
  • [40] Visual Commonsense-Aware Representation Network for Video Captioning
    Zeng, Pengpeng
    Zhang, Haonan
    Gao, Lianli
    Li, Xiangpeng
    Qian, Jin
    Shen, Heng Tao
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, : 1 - 12