Learning deep spatiotemporal features for video captioning

被引:9
|
作者
Daskalakis, Eleftherios [1 ]
Tzelepi, Maria [1 ]
Tefas, Anastasios [1 ]
机构
[1] Aristotle Univ Thessaloniki, Dept Informat, Thessaloniki, Greece
关键词
D O I
10.1016/j.patrec.2018.09.022
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we propose a novel automatic video captioning system which translates videos to sentences, utilizing a deep neural network that is composed of three building parts of convolutional and recurrent structure. That is, the first subnetwork operates as feature extractor of single frames. The second subnetwork is a three-stream network, capable of capturing spatial semantic information in the first stream, temporal semantic information in the second stream, and global video concept information in the third stream. The third subnetwork generates relevant textual captions using as input the spatiotemporal features of the second subnetwork. The experimental validation indicates the effectiveness of the proposed model, achieving superior performance over competitive methods. (C) 2018 Elsevier B.V. All rights reserved.
引用
收藏
页码:143 / 149
页数:7
相关论文
共 50 条
  • [1] Deep Learning for Video Captioning: A Review
    Chen, Shaoxiang
    Yao, Ting
    Jiang, Yu-Gang
    [J]. PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 6283 - 6290
  • [2] Image and Video Captioning for Apparels Using Deep Learning
    Agarwal, Govind
    Jindal, Kritika
    Chowdhury, Abishi
    Singh, Vishal K.
    Pal, Amrit
    [J]. IEEE ACCESS, 2024, 12 : 113138 - 113150
  • [3] Deep learning based, a new model for video captioning
    Department of Computer Engineering, Faculty of Engineering Gazi University, Ankara, Turkey
    [J]. Intl. J. Adv. Comput. Sci. Appl., 2020, 3 (514-519):
  • [4] Deep Learning based, a New Model for Video Captioning
    Ozer, Elif Gusta
    Karapinar, Ilteber Nur
    Busbug, Sena
    Turan, Sumeyye
    Utku, Anil
    Akcayol, M. Ali
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (03) : 514 - 519
  • [5] An efficient deep learning-based video captioning framework using multi-modal features
    Varma, Soumya
    James, Dinesh Peter
    [J]. EXPERT SYSTEMS, 2021,
  • [6] Video Captioning with Tube Features
    Zhao, Bin
    Li, Xuelong
    Lu, Xiaoqiang
    [J]. PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 1177 - 1183
  • [7] Multimodal Deep Neural Network with Image Sequence Features for Video Captioning
    Oura, Soichiro
    Matsukawa, Tetsu
    Suzuki, Einoshin
    [J]. 2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,
  • [8] SPATIOTEMPORAL UTILIZATION OF DEEP FEATURES FOR VIDEO SALIENCY DETECTION
    Le, Trung-Nghia
    Sugimoto, Akihiro
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW), 2017,
  • [9] Towards Unified Deep Learning Model for NSFW Image and Video Captioning
    Ko, Jong-Won
    Hwang, Dong-Hyun
    [J]. ADVANCED MULTIMEDIA AND UBIQUITOUS ENGINEERING, MUE/FUTURETECH 2018, 2019, 518 : 57 - 63
  • [10] Exploring Video Captioning Techniques: A Comprehensive Survey on Deep Learning Methods
    Islam S.
    Dash A.
    Seum A.
    Raj A.H.
    Hossain T.
    Shah F.M.
    [J]. SN Computer Science, 2021, 2 (2)