Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning

被引:158
|
作者
Aafaq, Nayyer [1 ]
Akhtar, Naveed [1 ]
Liu, Wei [1 ]
Gilani, Syed Zulqarnain [1 ]
Mian, Ajmal [1 ]
机构
[1] Univ Western Australia, Comp Sci & Software Engn, Nedlands, WA, Australia
关键词
D O I
10.1109/CVPR.2019.01277
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Automatic generation of video captions is a fundamental challenge in computer vision. Recent techniques typically employ a combination of Convolutional Neural Networks (CNNs) and Recursive Neural Networks (RNNs) for video captioning. These methods mainly focus on tailoring sequence learning through RNNs for better caption generation, whereas off-the-shelf visual features are borrowed from CNNs. We argue that careful designing of visual features for this task is equally important, and present a visual feature encoding technique to generate semantically rich captions using Gated Recurrent Units (GRUs). Our method embeds rich temporal dynamics in visual features by hierarchically applying Short Fourier Transform to CNN features of the whole video. It additionally derives high level semantics from an object detector to enrich the representation with spatial dynamics of the detected objects. The final representation is projected to a compact space and fed to a language model. By learning a relatively simple language model comprising two GRU layers, we establish new state-of-the-art on MSVD and MSR-VTT datasets for METEOR and ROUGE(L) metrics.
引用
收藏
页码:12479 / 12488
页数:10
相关论文
共 50 条
  • [21] A spatio-temporal network for video semantic segmentation in surgical videos
    Maria Grammatikopoulou
    Ricardo Sanchez-Matilla
    Felix Bragman
    David Owen
    Lucy Culshaw
    Karen Kerr
    Danail Stoyanov
    Imanol Luengo
    International Journal of Computer Assisted Radiology and Surgery, 2024, 19 : 375 - 382
  • [22] A spatio-temporal network for video semantic segmentation in surgical videos
    Grammatikopoulou, Maria
    Sanchez-Matilla, Ricardo
    Bragman, Felix
    Owen, David
    Culshaw, Lucy
    Kerr, Karen
    Stoyanov, Danail
    Luengo, Imanol
    INTERNATIONAL JOURNAL OF COMPUTER ASSISTED RADIOLOGY AND SURGERY, 2023, 19 (2) : 375 - 382
  • [23] Video Captioning with Visual and Semantic Features
    Lee, Sujin
    Kim, Incheol
    JOURNAL OF INFORMATION PROCESSING SYSTEMS, 2018, 14 (06): : 1318 - 1330
  • [24] A spatio-temporal network for video semantic segmentation in surgical videos
    Grammatikopoulou, Maria
    Sanchez-Matilla, Ricardo
    Bragman, Felix
    Owen, David
    Culshaw, Lucy
    Kerr, Karen
    Stoyanov, Danail
    Luengo, Imanol
    arXiv, 2023,
  • [25] Learning Deep Spatio-Temporal Dependence for Semantic Video Segmentation
    Qiu, Zhaofan
    Yao, Ting
    Mei, Tao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (04) : 939 - 949
  • [26] Spatio-Temporal Memory Attention for Image Captioning
    Ji, Junzhong
    Xu, Cheng
    Zhang, Xiaodan
    Wang, Boyue
    Song, Xinhang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 7615 - 7628
  • [27] ST-CLIP: Spatio-Temporal Enhanced CLIP Towards Dense Video Captioning
    Chen, Huimin
    Duan, Pengfei
    Huang, Mingru
    Guo, Jingyi
    Xiong, Shengwu
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT XI, ICIC 2024, 2024, 14872 : 396 - 407
  • [28] Visual mining and spatio-temporal querying in molecular dynamics
    Sourina, O
    Korolev, N
    JOURNAL OF COMPUTATIONAL AND THEORETICAL NANOSCIENCE, 2005, 2 (04) : 492 - 498
  • [29] VIMOS: A video mosaic for spatio-temporal representation of visual information
    Candan, KS
    Golshani, F
    Panchanathan, S
    Park, YC
    1998 IEEE SOUTHWEST SYMPOSIUM ON IMAGE ANALYSIS AND INTERPRETATION, 1998, : 6 - 11
  • [30] Survey on visual rhythms: A spatio-temporal representation for video sequences
    Roberto e Souza, Marcos
    Maia, Helena de Almeida
    Vieira, Marcelo Bernardes
    Pedrini, Helio
    NEUROCOMPUTING, 2020, 402 : 409 - 422