Understanding temporal structure for video captioning

被引:9
|
作者
Sah, Shagan [1 ]
Nguyen, Thang [2 ]
Ptucha, Ray [2 ]
机构
[1] Rochester Inst Technol, Chester F Carlson Ctr Imaging Sci, 54 Lomb Mem Dr, Rochester, NY 14623 USA
[2] Rochester Inst Technol, Comp Engn Dept, Machine Intelligence Lab, 09-3441,83 Lomb Mem Dr, Rochester, NY 14623 USA
关键词
Video captioning; Deep learning; Attention models; Hierarchical neural networks; LANGUAGE;
D O I
10.1007/s10044-018-00770-3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent research in convolutional and recurrent neural networks has fueled incredible advances in video understanding. We propose a video captioning framework that achieves the performance and quality necessary to be deployed in distributed surveillance systems. Our method combines an efficient hierarchical architecture with novel attention mechanisms at both the local and global levels. By shifting focus to different spatiotemporal locations, attention mechanisms correlate sequential outputs with activation maps, offering a clever way to adaptively combine multiple frames and locations of video. As soft attention mixing weights are solved via back-propagation, the number of weights or input frames needs to be known in advance. To remove this restriction, our video understanding framework combines continuous attention mechanisms over a family of Gaussian distributions. Our efficient multistream hierarchical model combines a recurrent architecture with a soft hierarchy layer using both equally spaced and dynamically localized boundary cuts. As opposed to costly volumetric attention approaches, we use video attributes to steer temporal attention. Our fully learnable end-to-end approach helps predict salient temporal regions of action/objects in the video. We demonstrate state-of-the-art captioning results on the popular MSVD, MSR-VTT and M-VAD video datasets and compare several variants of the algorithm suitable for real-time applications. By adjusting the frame rate, we show a single computer can generate effective video captions for 100 simultaneous cameras. We additionally perform studies to show how bit rate compression modifies captioning results.
引用
收藏
页码:147 / 159
页数:13
相关论文
共 50 条
  • [31] MM21 Pre-training for Video Understanding Challenge: Video Captioning with Pretraining Techniques
    Chen, Sihan
    Zhu, Xinxin
    Hao, Dongze
    Liu, Wei
    Liu, Jiawei
    Zhao, Zijia
    Guo, Longteng
    Liu, Jing
    [J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4853 - 4857
  • [32] Understanding Objects in Video: Object-Oriented Video Captioning via Structured Trajectory and Adversarial Learning
    Zhu, Fangyi
    Hwang, Jenq-Neng
    Ma, Zhanyu
    Chen, Guang
    Guo, Jun
    [J]. IEEE ACCESS, 2020, 8 : 169146 - 169159
  • [33] Video Captioning based on Image Captioning as Subsidiary Content
    Vaishnavi, J.
    Narmatha, V
    [J]. 2022 SECOND INTERNATIONAL CONFERENCE ON ADVANCES IN ELECTRICAL, COMPUTING, COMMUNICATION AND SUSTAINABLE TECHNOLOGIES (ICAECT), 2022,
  • [34] RESTHT: relation-enhanced spatial-temporal hierarchical transformer for video captioning
    Zheng, Lihuan
    Xu, Wanru
    Miao, Zhenjiang
    Qiu, Xinxiu
    Gong, Shanshan
    [J]. VISUAL COMPUTER, 2024,
  • [35] Video Captioning of Future Frames
    Hosseinzadeh, Mehrdad
    Wang, Yang
    [J]. 2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2021), 2021, : 979 - 988
  • [36] A Review Of Video Captioning Methods
    Mahajan, Dewarthi
    Bhosale, Sakshi
    Nighot, Yash
    Tayal, Madhuri
    [J]. INTERNATIONAL JOURNAL OF NEXT-GENERATION COMPUTING, 2021, 12 (05): : 708 - 715
  • [37] Temporal Attention Neural Network for Video Understanding
    Son, Jegyung
    Jang, Gil-Jin
    Lee, Minho
    [J]. NEURAL INFORMATION PROCESSING (ICONIP 2017), PT II, 2017, 10635 : 422 - 430
  • [38] Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning
    Aafaq, Nayyer
    Akhtar, Naveed
    Liu, Wei
    Gilani, Syed Zulqarnain
    Mian, Ajmal
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 12479 - 12488
  • [39] Spatio-Temporal Graph-based Semantic Compositional Network for Video Captioning
    Li, Shun
    Zhang, Ze-Fan
    Ji, Yi
    Li, Ying
    Liu, Chun-Ping
    [J]. 2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [40] Multirate Multimodal Video Captioning
    Yang, Ziwei
    Xu, Youjiang
    Wang, Huiyun
    Wang, Bo
    Han, Yahong
    [J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1877 - 1882