Understanding temporal structure for video captioning

被引:9
|
作者
Sah, Shagan [1 ]
Nguyen, Thang [2 ]
Ptucha, Ray [2 ]
机构
[1] Rochester Inst Technol, Chester F Carlson Ctr Imaging Sci, 54 Lomb Mem Dr, Rochester, NY 14623 USA
[2] Rochester Inst Technol, Comp Engn Dept, Machine Intelligence Lab, 09-3441,83 Lomb Mem Dr, Rochester, NY 14623 USA
关键词
Video captioning; Deep learning; Attention models; Hierarchical neural networks; LANGUAGE;
D O I
10.1007/s10044-018-00770-3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent research in convolutional and recurrent neural networks has fueled incredible advances in video understanding. We propose a video captioning framework that achieves the performance and quality necessary to be deployed in distributed surveillance systems. Our method combines an efficient hierarchical architecture with novel attention mechanisms at both the local and global levels. By shifting focus to different spatiotemporal locations, attention mechanisms correlate sequential outputs with activation maps, offering a clever way to adaptively combine multiple frames and locations of video. As soft attention mixing weights are solved via back-propagation, the number of weights or input frames needs to be known in advance. To remove this restriction, our video understanding framework combines continuous attention mechanisms over a family of Gaussian distributions. Our efficient multistream hierarchical model combines a recurrent architecture with a soft hierarchy layer using both equally spaced and dynamically localized boundary cuts. As opposed to costly volumetric attention approaches, we use video attributes to steer temporal attention. Our fully learnable end-to-end approach helps predict salient temporal regions of action/objects in the video. We demonstrate state-of-the-art captioning results on the popular MSVD, MSR-VTT and M-VAD video datasets and compare several variants of the algorithm suitable for real-time applications. By adjusting the frame rate, we show a single computer can generate effective video captions for 100 simultaneous cameras. We additionally perform studies to show how bit rate compression modifies captioning results.
引用
收藏
页码:147 / 159
页数:13
相关论文
共 50 条
  • [1] Understanding temporal structure for video captioning
    Shagan Sah
    Thang Nguyen
    Ray Ptucha
    [J]. Pattern Analysis and Applications, 2020, 23 : 147 - 159
  • [2] A Neural ODE and Transformer-based Model for Temporal Understanding and Dense Video Captioning
    Artham, Sainithin
    Shaikh, Soharab Hossain
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (23) : 64037 - 64056
  • [3] Exploiting the local temporal information for video captioning
    Wei, Ran
    Mi, Li
    Hu, Yaosi
    Chen, Zhenzhong
    [J]. JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2020, 67
  • [4] Temporal Attention Feature Encoding for Video Captioning
    Kim, Nayoung
    Ha, Seong Jong
    Kang, Je-Won
    [J]. 2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 1279 - 1282
  • [5] Traffic Scenario Understanding and Video Captioning via Guidance Attention Captioning Network
    Liu, Chunsheng
    Zhang, Xiao
    Chang, Faliang
    Li, Shuang
    Hao, Penghui
    Lu, Yansha
    Wang, Yinhai
    [J]. IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2024, 25 (05) : 3615 - 3627
  • [6] Context Gating with Short Temporal Information for Video Captioning
    Xu, Jinlei
    Xu, Ting
    Tian, Xin
    Liu, Chunping
    Ji, Yi
    [J]. 2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
  • [7] Catching the Temporal Regions-of-Interest for Video Captioning
    Yang, Ziwei
    Han, Yahong
    Wang, Zheng
    [J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 146 - 153
  • [8] VIDEO CAPTIONING WITH TEMPORAL AND REGION GRAPH CONVOLUTION NETWORK
    Xiao, Xinlong
    Zhang, Yuejie
    Feng, Rui
    Zhang, Tao
    Gao, Shang
    Fan, Weiguo
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2020,
  • [9] Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning
    Song, Jingkuan
    Gao, Lianli
    Guo, Zhao
    Liu, Wu
    Zhang, Dongxiang
    Shen, Heng Tao
    [J]. PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 2737 - 2743
  • [10] A core region captioning framework for automatic video understanding in story video contents
    Suh, Hyesun
    Kim, Jiyeon
    So, Jinsoo
    Jung, Jongjin
    [J]. INTERNATIONAL JOURNAL OF ENGINEERING BUSINESS MANAGEMENT, 2022, 14