Improving Video Captioning with Temporal Composition of a Visual-Syntactic Embedding

被引:27
|
作者
Perez-Martin, Jesus [1 ]
Bustos, Benjamin [1 ]
Perez, Jorge [1 ]
机构
[1] Univ Chile, Dept Comp Sci, Santiago, Chile
关键词
TEXT;
D O I
10.1109/WACV48630.2021.00308
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video captioning is the task of predicting a semantic and syntactically correct sequence of words given some context video. The most successful methods for video captioning have a strong dependency on the effectiveness of semantic representations learned from visual models, but often produce syntactically incorrect sentences which harms their performance on standard datasets. In this paper, we address this limitation by considering syntactic representation learning as an essential component of video captioning. We construct a visual-syntactic embedding by mapping into a common vector space a visual representation, that depends only on the video, with a syntactic representation that depends only on Part-of-Speech (POS) tagging structures of the video description. We integrate this joint representation into an encoder-decoder architecture that we call Visual-Semantic-Syntactic Aligned Network (SemSynAN), which guides the decoder (text generation stage) by aligning temporal compositions of visual, semantic, and syntactic representations. We tested our proposed architecture obtaining state-of-the-art results on two widely used video captioning datasets: the Microsoft Video Description (MSVD) dataset and the Microsoft Research Video-to-Text (MSRVTT) dataset.
引用
收藏
页码:3038 / 3048
页数:11
相关论文
共 50 条
  • [1] Visual-Syntactic Text Format: Improving Adolescent Literacy
    Tate, Tamara P.
    Collins, Penelope
    Xu, Ying
    Yau, Joanna C.
    Krishnan, Jenell
    Prado, Yenda
    Farkas, George
    Warschauer, Mark
    [J]. SCIENTIFIC STUDIES OF READING, 2019, 23 (04) : 287 - 304
  • [2] Scaffolding learning of language structures with visual-syntactic text formatting
    Park, Youngmin
    Xu, Ying
    Collins, Penelope
    Farkas, George
    Warschauer, Mark
    [J]. BRITISH JOURNAL OF EDUCATIONAL TECHNOLOGY, 2019, 50 (04) : 1896 - 1912
  • [3] Deep multimodal embedding for video captioning
    Jin Young Lee
    [J]. Multimedia Tools and Applications, 2019, 78 : 31793 - 31805
  • [4] Deep multimodal embedding for video captioning
    Lee, Jin Young
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (22) : 31793 - 31805
  • [5] Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video Captioning
    Dong, Shanshan
    Niu, Tianzi
    Luo, Xin
    Liu, Wu
    Xu, Xinshun
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (02)
  • [6] Early Embedding and Late Reranking for Video Captioning
    Dong, Jianfeng
    Li, Xirong
    Lan, Weiyu
    Huo, Yujia
    Snoek, Cees G. M.
    [J]. MM'16: PROCEEDINGS OF THE 2016 ACM MULTIMEDIA CONFERENCE, 2016, : 1082 - 1086
  • [7] Visual-Syntactic Text Formatting: Developing EFL Learners’ Reading Fluency Components
    Wei Gao
    Ehsan Namaziandost
    Mohammad Awad Al-Dawoody Abdulaal
    [J]. Journal of Psycholinguistic Research, 2022, 51 : 707 - 727
  • [8] Visual-Syntactic Text Formatting: Developing EFL Learners' Reading Fluency Components
    Gao, Wei
    Namaziandost, Ehsan
    Abdulaal, Mohammad Awad Al-Dawoody
    [J]. JOURNAL OF PSYCHOLINGUISTIC RESEARCH, 2022, 51 (04) : 707 - 727
  • [9] Understanding temporal structure for video captioning
    Sah, Shagan
    Nguyen, Thang
    Ptucha, Ray
    [J]. PATTERN ANALYSIS AND APPLICATIONS, 2020, 23 (01) : 147 - 159
  • [10] Understanding temporal structure for video captioning
    Shagan Sah
    Thang Nguyen
    Ray Ptucha
    [J]. Pattern Analysis and Applications, 2020, 23 : 147 - 159