Understanding temporal structure for video captioning

被引：9

作者：

Sah, Shagan ^{[1
]}

Nguyen, Thang ^{[2
]}

Ptucha, Ray ^{[2
]}

机构：

[1] Rochester Inst Technol, Chester F Carlson Ctr Imaging Sci, 54 Lomb Mem Dr, Rochester, NY 14623 USA

[2] Rochester Inst Technol, Comp Engn Dept, Machine Intelligence Lab, 09-3441,83 Lomb Mem Dr, Rochester, NY 14623 USA

来源：

PATTERN ANALYSIS AND APPLICATIONS | 2020年 / 23卷 / 01期

关键词：

Video captioning; Deep learning; Attention models; Hierarchical neural networks; LANGUAGE;

D O I：

10.1007/s10044-018-00770-3

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent research in convolutional and recurrent neural networks has fueled incredible advances in video understanding. We propose a video captioning framework that achieves the performance and quality necessary to be deployed in distributed surveillance systems. Our method combines an efficient hierarchical architecture with novel attention mechanisms at both the local and global levels. By shifting focus to different spatiotemporal locations, attention mechanisms correlate sequential outputs with activation maps, offering a clever way to adaptively combine multiple frames and locations of video. As soft attention mixing weights are solved via back-propagation, the number of weights or input frames needs to be known in advance. To remove this restriction, our video understanding framework combines continuous attention mechanisms over a family of Gaussian distributions. Our efficient multistream hierarchical model combines a recurrent architecture with a soft hierarchy layer using both equally spaced and dynamically localized boundary cuts. As opposed to costly volumetric attention approaches, we use video attributes to steer temporal attention. Our fully learnable end-to-end approach helps predict salient temporal regions of action/objects in the video. We demonstrate state-of-the-art captioning results on the popular MSVD, MSR-VTT and M-VAD video datasets and compare several variants of the algorithm suitable for real-time applications. By adjusting the frame rate, we show a single computer can generate effective video captions for 100 simultaneous cameras. We additionally perform studies to show how bit rate compression modifies captioning results.

引用

页码：147 / 159

页数：13

共 50 条

[31] MM21 Pre-training for Video Understanding Challenge: Video Captioning with Pretraining Techniques
Chen, Sihan
Zhu, Xinxin
Hao, Dongze
Liu, Wei
Liu, Jiawei
Zhao, Zijia
Guo, Longteng
Liu, Jing
[J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4853 - 4857
[32] Understanding Objects in Video: Object-Oriented Video Captioning via Structured Trajectory and Adversarial Learning
Zhu, Fangyi
Hwang, Jenq-Neng
Ma, Zhanyu
Chen, Guang
Guo, Jun
[J]. IEEE ACCESS, 2020, 8 : 169146 - 169159
[33] Video Captioning based on Image Captioning as Subsidiary Content
Vaishnavi, J.
Narmatha, V
[J]. 2022 SECOND INTERNATIONAL CONFERENCE ON ADVANCES IN ELECTRICAL, COMPUTING, COMMUNICATION AND SUSTAINABLE TECHNOLOGIES (ICAECT), 2022,
[34] RESTHT: relation-enhanced spatial-temporal hierarchical transformer for video captioning
Zheng, Lihuan
Xu, Wanru
Miao, Zhenjiang
Qiu, Xinxiu
Gong, Shanshan
[J]. VISUAL COMPUTER, 2024,
[35] Video Captioning of Future Frames
Hosseinzadeh, Mehrdad
Wang, Yang
[J]. 2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2021), 2021, : 979 - 988
[36] A Review Of Video Captioning Methods
Mahajan, Dewarthi
Bhosale, Sakshi
Nighot, Yash
Tayal, Madhuri
[J]. INTERNATIONAL JOURNAL OF NEXT-GENERATION COMPUTING, 2021, 12 (05): : 708 - 715
[37] Temporal Attention Neural Network for Video Understanding
Son, Jegyung
Jang, Gil-Jin
Lee, Minho
[J]. NEURAL INFORMATION PROCESSING (ICONIP 2017), PT II, 2017, 10635 : 422 - 430
[38] Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning
Aafaq, Nayyer
Akhtar, Naveed
Liu, Wei
Gilani, Syed Zulqarnain
Mian, Ajmal
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 12479 - 12488
[39] Spatio-Temporal Graph-based Semantic Compositional Network for Video Captioning
Li, Shun
Zhang, Ze-Fan
Ji, Yi
Li, Ying
Liu, Chun-Ping
[J]. 2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
[40] Multirate Multimodal Video Captioning
Yang, Ziwei
Xu, Youjiang
Wang, Huiyun
Wang, Bo
Han, Yahong
[J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1877 - 1882

← 1 2 3 4 5 →