Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning

被引:158
|
作者
Aafaq, Nayyer [1 ]
Akhtar, Naveed [1 ]
Liu, Wei [1 ]
Gilani, Syed Zulqarnain [1 ]
Mian, Ajmal [1 ]
机构
[1] Univ Western Australia, Comp Sci & Software Engn, Nedlands, WA, Australia
关键词
D O I
10.1109/CVPR.2019.01277
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Automatic generation of video captions is a fundamental challenge in computer vision. Recent techniques typically employ a combination of Convolutional Neural Networks (CNNs) and Recursive Neural Networks (RNNs) for video captioning. These methods mainly focus on tailoring sequence learning through RNNs for better caption generation, whereas off-the-shelf visual features are borrowed from CNNs. We argue that careful designing of visual features for this task is equally important, and present a visual feature encoding technique to generate semantically rich captions using Gated Recurrent Units (GRUs). Our method embeds rich temporal dynamics in visual features by hierarchically applying Short Fourier Transform to CNN features of the whole video. It additionally derives high level semantics from an object detector to enrich the representation with spatial dynamics of the detected objects. The final representation is projected to a compact space and fed to a language model. By learning a relatively simple language model comprising two GRU layers, we establish new state-of-the-art on MSVD and MSR-VTT datasets for METEOR and ROUGE(L) metrics.
引用
收藏
页码:12479 / 12488
页数:10
相关论文
共 50 条
  • [1] Spatio-Temporal Graph-based Semantic Compositional Network for Video Captioning
    Li, Shun
    Zhang, Ze-Fan
    Ji, Yi
    Li, Ying
    Liu, Chun-Ping
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [2] Diverse Video Captioning by Adaptive Spatio-temporal Attention
    Ghaderi, Zohreh
    Salewski, Leonard
    Lensch, Hendrik P. A.
    PATTERN RECOGNITION, DAGM GCPR 2022, 2022, 13485 : 409 - 425
  • [3] Spatio-Temporal Attention Models for Grounded Video Captioning
    Zanfir, Mihai
    Marinoiu, Elisabeta
    Sminchisescu, Cristian
    COMPUTER VISION - ACCV 2016, PT IV, 2017, 10114 : 104 - 119
  • [4] Exploring the Spatio-Temporal Aware Graph for video captioning
    Xue, Ping
    Zhou, Bing
    IET COMPUTER VISION, 2022, 16 (05) : 456 - 467
  • [5] Spatio-temporal Super-resolution Network: Enhance Visual Representations for Video Captioning
    Cao, Quanhui
    Tang, Pengjie
    Wang, Hanli
    2022 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS 22), 2022, : 3125 - 3129
  • [6] STSI: Efficiently Mine Spatio-Temporal Semantic Information between Different Multimodal for Video Captioning
    Xiong, Huiyu
    Wang, Lanxiao
    2022 IEEE INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2022,
  • [7] Video Captioning via Sentence Augmentation and Spatio-Temporal Attention
    Chen, Tseng-Hung
    Zeng, Kuo-Hao
    Hsu, Wan-Ting
    Sun, Min
    COMPUTER VISION - ACCV 2016 WORKSHOPS, PT I, 2017, 10116 : 269 - 286
  • [8] Spatio-Temporal Ranked-Attention Networks for Video Captioning
    Cherian, Anoop
    Wang, Jue
    Hori, Chiori
    Marks, Tim K.
    2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2020, : 1606 - 1615
  • [9] Video Captioning With Object-Aware Spatio-Temporal Correlation and Aggregation
    Zhang, Junchao
    Peng, Yuxin
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 (29) : 6209 - 6222
  • [10] Semantic spatio-temporal segmentation for extracting video objects
    Mao, JH
    Ma, KK
    IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA COMPUTING AND SYSTEMS, PROCEEDINGS VOL 1, 1999, : 738 - 743