Improving Video Captioning with Temporal Composition of a Visual-Syntactic Embedding

被引：27

作者：

Perez-Martin, Jesus ^{[1
]}

Bustos, Benjamin ^{[1
]}

Perez, Jorge ^{[1
]}

机构：

[1] Univ Chile, Dept Comp Sci, Santiago, Chile

来源：

2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021 | 2021年

关键词：

TEXT;

D O I：

10.1109/WACV48630.2021.00308

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video captioning is the task of predicting a semantic and syntactically correct sequence of words given some context video. The most successful methods for video captioning have a strong dependency on the effectiveness of semantic representations learned from visual models, but often produce syntactically incorrect sentences which harms their performance on standard datasets. In this paper, we address this limitation by considering syntactic representation learning as an essential component of video captioning. We construct a visual-syntactic embedding by mapping into a common vector space a visual representation, that depends only on the video, with a syntactic representation that depends only on Part-of-Speech (POS) tagging structures of the video description. We integrate this joint representation into an encoder-decoder architecture that we call Visual-Semantic-Syntactic Aligned Network (SemSynAN), which guides the decoder (text generation stage) by aligning temporal compositions of visual, semantic, and syntactic representations. We tested our proposed architecture obtaining state-of-the-art results on two widely used video captioning datasets: the Microsoft Video Description (MSVD) dataset and the Microsoft Research Video-to-Text (MSRVTT) dataset.

引用

页码：3038 / 3048

页数：11

共 50 条

[41] Adaptively Converting Auxiliary Attributes and Textual Embedding for Video Captioning Based on BiLSTM
Shuqin Chen
Xian Zhong
Lin Li
Wenxuan Liu
Cheng Gu
Luo Zhong
[J]. Neural Processing Letters, 2020, 52 : 2353 - 2369
[42] Adaptively Converting Auxiliary Attributes and Textual Embedding for Video Captioning Based on BiLSTM
Chen, Shuqin
Zhong, Xian
Li, Lin
Liu, Wenxuan
Gu, Cheng
Zhong, Luo
[J]. NEURAL PROCESSING LETTERS, 2020, 52 (03) : 2353 - 2369
[43] Rich Visual and Language Representation with Complementary Semantics for Video Captioning
Tang, Pengjie
Wang, Hanli
Li, Qinyu
[J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (02)
[44] Learning to enhance areal video captioning with visual question answering
Al Mehmadi, Shima M.
Bazi, Yakoub
Al Rahhal, Mohamad M.
Zuair, Mansour
[J]. INTERNATIONAL JOURNAL OF REMOTE SENSING, 2024, 45 (18) : 6395 - 6407
[45] Improving Image Captioning through Visual and Semantic Mutual Promotion
Zhang, Jing
Xie, Yingshuai
Liu, Xiaoqiang
[J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4716 - 4724
[46] Visual versus Textual Embedding for Video Retrieval
Francis, Danny
Pidou, Paul
Merialdo, Bernard
Huet, Benoit
[J]. ADVANCED CONCEPTS FOR INTELLIGENT VISION SYSTEMS (ACIVS 2017), 2017, 10617 : 386 - 395
[47] Video Captioning via Sentence Augmentation and Spatio-Temporal Attention
Chen, Tseng-Hung
Zeng, Kuo-Hao
Hsu, Wan-Ting
Sun, Min
[J]. COMPUTER VISION - ACCV 2016 WORKSHOPS, PT I, 2017, 10116 : 269 - 286
[48] Multi-scale features with temporal information guidance for video captioning
Zhao, Hong
Chen, Zhiwen
Yang, Yi
[J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 137
[49] Keyphrase Extraction by Improving TextRank with an Integration of Word Embedding and Syntactic Information
Zhang, Sheng
Luo, Qi
Feng, Yukun
Ding, Ke
Gifu, Daniela
Zhang, Silan
Ma, Xiaohang
Xia, Jingbo
[J]. Recent Advances in Computer Science and Communications, 2021, 14 (09) : 2969 - 2975
[50] Spatio-Temporal Ranked-Attention Networks for Video Captioning
Cherian, Anoop
Wang, Jue
Hori, Chiori
Marks, Tim K.
[J]. 2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2020, : 1606 - 1615

← 1 2 3 4 5 →