Video Captioning with Visual and Semantic Features

被引:5
|
作者
Lee, Sujin [1 ]
Kim, Incheol [2 ]
机构
[1] Kyonggi Univ, Dept Comp Sci, Grad Sch, Suwon, South Korea
[2] Kyonggi Univ, Dept Comp Sci, Suwon, South Korea
来源
关键词
Attention-Based Caption Generation; Deep Neural Networks; Semantic Feature; Video Captioning;
D O I
10.3745/JIPS.02.0098
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Video captioning refers to the process of extracting features from a video and generating video captions using the extracted features. This paper introduces a deep neural network model and its learning method for effective video captioning. In this study, visual features as well as semantic features, which effectively express the video, are also used. The visual features of the video are extracted using convolutional neural networks, such as C3D and ResNet, while the semantic features are extracted using a semantic feature extraction network proposed in this paper. Further, an attention-based caption generation network is proposed for effective generation of video captions using the extracted features. The performance and effectiveness of the proposed model is verified through various experiments using two large-scale video benchmarks such as the Microsoft Video Description (MSVD) and the Microsoft Research Video-To-Text (MSR-VTT).
引用
收藏
页码:1318 / 1330
页数:13
相关论文
共 50 条
  • [31] Video captioning with stacked attention and semantic hard pull
    Rahman, Md Mushfiqur
    Abedin, Thasin
    Prottoy, Khondokar S. S.
    Moshruba, Ayana
    Siddiqui, Fazlul Hasan
    PEERJ COMPUTER SCIENCE, 2021, 7 : 1 - 18
  • [32] Semantic Tag Augmented XlanV Model for Video Captioning
    Huang, Yiqing
    Xue, Hongwei
    Chen, Jiansheng
    Ma, Huimin
    Ma, Hongbing
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4818 - 4822
  • [33] Visual to Text: Survey of Image and Video Captioning
    Li, Sheng
    Tao, Zhiqiang
    Li, Kang
    Fu, Yun
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2019, 3 (04): : 297 - 312
  • [34] Learning deep spatiotemporal features for video captioning
    Daskalakis, Eleftherios
    Tzelepi, Maria
    Tefas, Anastasios
    PATTERN RECOGNITION LETTERS, 2018, 116 : 143 - 149
  • [35] Integrated mining of visual features, speech features, and frequent patterns for semantic video annotation
    Tseng, Vincent S.
    Su, Ja-Hwung
    Huang, Jhih-Hong
    Chen, Chih-Jen
    IEEE TRANSACTIONS ON MULTIMEDIA, 2008, 10 (02) : 260 - 267
  • [36] Image Captioning With Visual-Semantic Double Attention
    He, Chen
    Hu, Haifeng
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (01)
  • [37] Aligned visual semantic scene graph for image captioning
    Zhao, Shanshan
    Li, Lixiang
    Peng, Haipeng
    DISPLAYS, 2022, 74
  • [38] Semantic analysis based on fusion of audio/visual features for soccer video
    Wang, Zengkai
    PROCEEDINGS OF THE 10TH INTERNATIONAL CONFERENCE OF INFORMATION AND COMMUNICATION TECHNOLOGY, 2021, 183 : 563 - 571
  • [39] Combining caption and visual features for semantic event classification of baseball video
    Lie, WN
    Shia, SH
    2005 IEEE International Conference on Multimedia and Expo (ICME), Vols 1 and 2, 2005, : 1255 - 1258
  • [40] When Visual Object-Context Features Meet Generic and Specific Semantic Priors in Image Captioning
    Liu, Heng
    Tian, Chunna
    Jiang, Mengmeng
    TENTH INTERNATIONAL CONFERENCE ON GRAPHICS AND IMAGE PROCESSING (ICGIP 2018), 2019, 11069