Video Captioning with Visual and Semantic Features

被引：5

作者：

Lee, Sujin ^{[1
]}

Kim, Incheol ^{[2
]}

机构：

[1] Kyonggi Univ, Dept Comp Sci, Grad Sch, Suwon, South Korea

[2] Kyonggi Univ, Dept Comp Sci, Suwon, South Korea

来源：

JOURNAL OF INFORMATION PROCESSING SYSTEMS | 2018年 / 14卷 / 06期

关键词：

Attention-Based Caption Generation; Deep Neural Networks; Semantic Feature; Video Captioning;

D O I：

10.3745/JIPS.02.0098

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Video captioning refers to the process of extracting features from a video and generating video captions using the extracted features. This paper introduces a deep neural network model and its learning method for effective video captioning. In this study, visual features as well as semantic features, which effectively express the video, are also used. The visual features of the video are extracted using convolutional neural networks, such as C3D and ResNet, while the semantic features are extracted using a semantic feature extraction network proposed in this paper. Further, an attention-based caption generation network is proposed for effective generation of video captions using the extracted features. The performance and effectiveness of the proposed model is verified through various experiments using two large-scale video benchmarks such as the Microsoft Video Description (MSVD) and the Microsoft Research Video-To-Text (MSR-VTT).

引用

页码：1318 / 1330

页数：13

共 50 条

[31] Video captioning with stacked attention and semantic hard pull
Rahman, Md Mushfiqur
Abedin, Thasin
Prottoy, Khondokar S. S.
Moshruba, Ayana
Siddiqui, Fazlul Hasan
PEERJ COMPUTER SCIENCE, 2021, 7 : 1 - 18
[32] Semantic Tag Augmented XlanV Model for Video Captioning
Huang, Yiqing
Xue, Hongwei
Chen, Jiansheng
Ma, Huimin
Ma, Hongbing
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4818 - 4822
[33] Visual to Text: Survey of Image and Video Captioning
Li, Sheng
Tao, Zhiqiang
Li, Kang
Fu, Yun
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2019, 3 (04): : 297 - 312
[34] Learning deep spatiotemporal features for video captioning
Daskalakis, Eleftherios
Tzelepi, Maria
Tefas, Anastasios
PATTERN RECOGNITION LETTERS, 2018, 116 : 143 - 149
[35] Integrated mining of visual features, speech features, and frequent patterns for semantic video annotation
Tseng, Vincent S.
Su, Ja-Hwung
Huang, Jhih-Hong
Chen, Chih-Jen
IEEE TRANSACTIONS ON MULTIMEDIA, 2008, 10 (02) : 260 - 267
[36] Image Captioning With Visual-Semantic Double Attention
He, Chen
Hu, Haifeng
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (01)
[37] Aligned visual semantic scene graph for image captioning
Zhao, Shanshan
Li, Lixiang
Peng, Haipeng
DISPLAYS, 2022, 74
[38] Semantic analysis based on fusion of audio/visual features for soccer video
Wang, Zengkai
PROCEEDINGS OF THE 10TH INTERNATIONAL CONFERENCE OF INFORMATION AND COMMUNICATION TECHNOLOGY, 2021, 183 : 563 - 571
[39] Combining caption and visual features for semantic event classification of baseball video
Lie, WN
Shia, SH
2005 IEEE International Conference on Multimedia and Expo (ICME), Vols 1 and 2, 2005, : 1255 - 1258
[40] When Visual Object-Context Features Meet Generic and Specific Semantic Priors in Image Captioning
Liu, Heng
Tian, Chunna
Jiang, Mengmeng
TENTH INTERNATIONAL CONFERENCE ON GRAPHICS AND IMAGE PROCESSING (ICGIP 2018), 2019, 11069

← 1 2 3 4 5 →