Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning

被引：158

作者：

Aafaq, Nayyer ^{[1
]}

Akhtar, Naveed ^{[1
]}

Liu, Wei ^{[1
]}

Gilani, Syed Zulqarnain ^{[1
]}

Mian, Ajmal ^{[1
]}

机构：

[1] Univ Western Australia, Comp Sci & Software Engn, Nedlands, WA, Australia

来源：

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019) | 2019年

关键词：

D O I：

10.1109/CVPR.2019.01277

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Automatic generation of video captions is a fundamental challenge in computer vision. Recent techniques typically employ a combination of Convolutional Neural Networks (CNNs) and Recursive Neural Networks (RNNs) for video captioning. These methods mainly focus on tailoring sequence learning through RNNs for better caption generation, whereas off-the-shelf visual features are borrowed from CNNs. We argue that careful designing of visual features for this task is equally important, and present a visual feature encoding technique to generate semantically rich captions using Gated Recurrent Units (GRUs). Our method embeds rich temporal dynamics in visual features by hierarchically applying Short Fourier Transform to CNN features of the whole video. It additionally derives high level semantics from an object detector to enrich the representation with spatial dynamics of the detected objects. The final representation is projected to a compact space and fed to a language model. By learning a relatively simple language model comprising two GRU layers, we establish new state-of-the-art on MSVD and MSR-VTT datasets for METEOR and ROUGE(L) metrics.

引用

页码：12479 / 12488

页数：10

共 50 条

[41] Towards semantical queries: Integrating visual and spatio-temporal video features
Aghbari, Zaher, 2000, IEICE of Japan, Tokyo, Japan (E83-D)
[42] Human-Centric Spatio-Temporal Video Grounding With Visual Transformers
Tang, Zongheng
Liao, Yue
Liu, Si
Li, Guanbin
Jin, Xiaojie
Jiang, Hongxu
Yu, Qian
Xu, Dong
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (12) : 8238 - 8249
[43] Towards semantical queries: Integrating visual and spatio-temporal video features
Aghbari, Z
Kaneko, K
Makinouchi, A
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2000, E83D (12) : 2075 - 2087
[44] Heterogeneous Video Transcoding to Lower Spatio-Temporal Resolutions and Different Encoding Formats
Shanableh, Tamer
Ghanbari, Mohammed
IEEE TRANSACTIONS ON MULTIMEDIA, 2000, 2 (02) : 101 - 110
[45] Interactive spatio-temporal visual map model for web video retrieval
Luan, Huan-Bo
Lin, Shou-Xun
Tang, Sheng
Neo, Shi-Yong
Chua, Tat-Seng
2007 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-5, 2007, : 560 - +
[46] Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic Role Labeling
Zhao, Yu
Fei, Hao
Cao, Yixin
Li, Bobo
Zhang, Meishan
Wei, Jianguo
Zhang, Min
Chua, Tat-Seng
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5281 - 5291
[47] Efficient Spatio-Temporal Video Grounding with Semantic-Guided Feature Decomposition
Wang, Weikang
Liu, Jing
Su, Yuting
Nie, Weizhi
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4867 - 4876
[48] Spatio-temporal dynamics of intracellular dynamics
Miyawaki, A
JOURNAL OF PHARMACOLOGICAL SCIENCES, 2005, 97 : 33P - 33P
[49] Attentive Visual Semantic Specialized Network for Video Captioning
Perez-Martin, Jesus
Bustos, Benjamin
Perez, Jorge
2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 5767 - 5774
[50] A Spatio-Temporal Encoding Neural Network for Semantic Segmentation of Satellite Image Time Series
Zhang, Feifei
Wang, Yong
Du, Yawen
Zhu, Yijia
APPLIED SCIENCES-BASEL, 2023, 13 (23):

← 1 2 3 4 5 →