Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning

被引:158
|
作者
Aafaq, Nayyer [1 ]
Akhtar, Naveed [1 ]
Liu, Wei [1 ]
Gilani, Syed Zulqarnain [1 ]
Mian, Ajmal [1 ]
机构
[1] Univ Western Australia, Comp Sci & Software Engn, Nedlands, WA, Australia
关键词
D O I
10.1109/CVPR.2019.01277
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Automatic generation of video captions is a fundamental challenge in computer vision. Recent techniques typically employ a combination of Convolutional Neural Networks (CNNs) and Recursive Neural Networks (RNNs) for video captioning. These methods mainly focus on tailoring sequence learning through RNNs for better caption generation, whereas off-the-shelf visual features are borrowed from CNNs. We argue that careful designing of visual features for this task is equally important, and present a visual feature encoding technique to generate semantically rich captions using Gated Recurrent Units (GRUs). Our method embeds rich temporal dynamics in visual features by hierarchically applying Short Fourier Transform to CNN features of the whole video. It additionally derives high level semantics from an object detector to enrich the representation with spatial dynamics of the detected objects. The final representation is projected to a compact space and fed to a language model. By learning a relatively simple language model comprising two GRU layers, we establish new state-of-the-art on MSVD and MSR-VTT datasets for METEOR and ROUGE(L) metrics.
引用
收藏
页码:12479 / 12488
页数:10
相关论文
共 50 条
  • [31] Temporal Attention Feature Encoding for Video Captioning
    Kim, Nayoung
    Ha, Seong Jong
    Kang, Je-Won
    2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 1279 - 1282
  • [32] Spatio-temporal Sampling for Video
    Shankar, Mohan
    Pitsiauis, Nikos P.
    Brady, David
    IMAGE RECONSTRUCTION FROM INCOMPLETE DATA V, 2008, 7076
  • [33] Video2Vec: Learning Semantic Spatio-Temporal Embeddings for Video Representation
    Hu, Sheng-Hung
    Li, Yikang
    Li, Baoxin
    2016 23RD INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2016, : 811 - 816
  • [34] Structured Encoding Based on Semantic Disambiguation for Video Captioning
    Sun, Bo
    Tian, Jinyu
    Wu, Yong
    Yu, Lunjun
    Tang, Yuanyan
    COGNITIVE COMPUTATION, 2024, 16 (03) : 1032 - 1048
  • [35] SPATIO-TEMPORAL BINARY VIDEO INPAINTING VIA THRESHOLD DYNAMICS
    Oliver, M.
    Palomares, R. P.
    Ballester, C.
    Haro, G.
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 1822 - 1826
  • [36] Qualitative semantic spatio-temporal reasoning based on description logics for modeling dynamics of spatio-temporal objects in satellite images
    Ghazouani, Fethi
    Farah, Imed Riadh
    Solaiman, Basel
    2018 4TH INTERNATIONAL CONFERENCE ON ADVANCED TECHNOLOGIES FOR SIGNAL AND IMAGE PROCESSING (ATSIP), 2018,
  • [37] Pedestrian Attribute Recognition via Spatio-temporal Relationship Learning for Visual Surveillance
    Liu, Zhenyu
    Li, Da
    Zhang, Xinyu
    Zhang, Zhang
    Zhang, Peng
    Shan, Caifeng
    Han, Jungong
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (06)
  • [38] The Evolution of Meaning: Spatio-temporal Dynamics of Visual Object Recognition
    Clarke, Alex
    Taylor, Kirsten I.
    Tyler, Lorraine K.
    JOURNAL OF COGNITIVE NEUROSCIENCE, 2011, 23 (08) : 1887 - 1899
  • [39] Reduced complexity spatio-temporal scalable motion compensated wavelet video encoding
    Turaga, DS
    van der Schaar, M
    2003 INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL II, PROCEEDINGS, 2003, : 561 - 564
  • [40] Spatio-temporal dynamics of the visual system revealed in binocular rivalry
    Taya, F
    Mogi, K
    NEUROSCIENCE LETTERS, 2005, 381 (1-2) : 63 - 68