Improving Video Captioning with Temporal Composition of a Visual-Syntactic Embedding

被引：27

作者：

Perez-Martin, Jesus ^{[1
]}

Bustos, Benjamin ^{[1
]}

Perez, Jorge ^{[1
]}

机构：

[1] Univ Chile, Dept Comp Sci, Santiago, Chile

来源：

2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021 | 2021年

关键词：

TEXT;

D O I：

10.1109/WACV48630.2021.00308

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video captioning is the task of predicting a semantic and syntactically correct sequence of words given some context video. The most successful methods for video captioning have a strong dependency on the effectiveness of semantic representations learned from visual models, but often produce syntactically incorrect sentences which harms their performance on standard datasets. In this paper, we address this limitation by considering syntactic representation learning as an essential component of video captioning. We construct a visual-syntactic embedding by mapping into a common vector space a visual representation, that depends only on the video, with a syntactic representation that depends only on Part-of-Speech (POS) tagging structures of the video description. We integrate this joint representation into an encoder-decoder architecture that we call Visual-Semantic-Syntactic Aligned Network (SemSynAN), which guides the decoder (text generation stage) by aligning temporal compositions of visual, semantic, and syntactic representations. We tested our proposed architecture obtaining state-of-the-art results on two widely used video captioning datasets: the Microsoft Video Description (MSVD) dataset and the Microsoft Research Video-to-Text (MSRVTT) dataset.

引用

页码：3038 / 3048

页数：11

共 50 条

[31] STAT: Spatial-Temporal Attention Mechanism for Video Captioning
Yan, Chenggang
Tu, Yunbin
Wang, Xingzheng
Zhang, Yongbing
Hao, Xinhong
Zhang, Yongdong
Dai, Qionghai
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (01) : 229 - 241
[32] Hierarchical Global-Local Temporal Modeling for Video Captioning
Hu, Yaosi
Chen, Zhenzhong
Zha, Zheng-Jun
Wu, Feng
[J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 774 - 783
[33] Diverse Video Captioning by Adaptive Spatio-temporal Attention
Ghaderi, Zohreh
Salewski, Leonard
Lensch, Hendrik P. A.
[J]. PATTERN RECOGNITION, DAGM GCPR 2022, 2022, 13485 : 409 - 425
[34] Video Captioning Based on the Spatial-Temporal Saliency Tracing
Zhou, Yuanen
Hu, Zhenzhen
Liu, Xueliang
Wang, Meng
[J]. ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT I, 2018, 11164 : 59 - 70
[35] Exploiting long-term temporal dynamics for video captioning
Yuyu Guo
Jingqiu Zhang
Lianli Gao
[J]. World Wide Web, 2019, 22 : 735 - 749
[36] Exploiting long-term temporal dynamics for video captioning
Guo, Yuyu
Zhang, Jingqiu
Gao, Lianli
[J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2019, 22 (02): : 735 - 749
[37] Exploring the Spatio-Temporal Aware Graph for video captioning
Xue, Ping
Zhou, Bing
[J]. IET COMPUTER VISION, 2022, 16 (05) : 456 - 467
[38] Spatio-Temporal Attention Models for Grounded Video Captioning
Zanfir, Mihai
Marinoiu, Elisabeta
Sminchisescu, Cristian
[J]. COMPUTER VISION - ACCV 2016, PT IV, 2017, 10114 : 104 - 119
[39] Fused GRU with semantic-temporal attention for video captioning
Gao, Lianli
Wang, Xuanhan
Song, Jingkuan
Liu, Yang
[J]. NEUROCOMPUTING, 2020, 395 (395) : 222 - 228
[40] Visual Commonsense-Aware Representation Network for Video Captioning
Zeng, Pengpeng
Zhang, Haonan
Gao, Lianli
Li, Xiangpeng
Qian, Jin
Shen, Heng Tao
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, : 1 - 12

← 1 2 3 4 5 →