Video description: A comprehensive survey of deep learning approaches

被引:10
|
作者
Rafiq, Ghazala [1 ]
Rafiq, Muhammad [2 ]
Choi, Gyu Sang [1 ]
机构
[1] Yeungnam Univ, Dept Informat & Commun Engn, Gyongsan 38541, South Korea
[2] Keimyung Univ, Dept Game & Mobile Engn, 1095 Dalgubeol Daero, Daegu 42601, South Korea
基金
新加坡国家研究基金会;
关键词
Deep learning; Encoder-Decoder architecture; Text description; Video captioning techniques; Video description approaches; Video captioning; Vision to text; NETWORKS;
D O I
10.1007/s10462-023-10414-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video description refers to understanding visual content and transforming that acquired understanding into automatic textual narration. It bridges the key AI fields of computer vision and natural language processing in conjunction with real-time and practical applications. Deep learning-based approaches employed for video description have demonstrated enhanced results compared to conventional approaches. The current literature lacks a thorough interpretation of the recently developed and employed sequence to sequence techniques for video description. This paper fills that gap by focusing mainly on deep learning-enabled approaches to automatic caption generation. Sequence to sequence models follow an Encoder-Decoder architecture employing a specific composition of CNN, RNN, or the variants LSTM or GRU as an encoder and decoder block. This standard-architecture can be fused with an attention mechanism to focus on a specific distinctiveness, achieving high quality results. Reinforcement learning employed within the Encoder-Decoder structure can progressively deliver state-of-the-art captions by following exploration and exploitation strategies. The transformer mechanism is a modern and efficient transductive architecture for robust output. Free from recurrence, and solely based on self-attention, it allows parallelization along with training on a massive amount of data. It can fully utilize the available GPUs for most NLP tasks. Recently, with the emergence of several versions of transformers, long term dependency handling is not an issue anymore for researchers engaged in video processing for summarization and description, or for autonomous-vehicle, surveillance, and instructional purposes. They can get auspicious directions from this research.
引用
收藏
页码:13293 / 13372
页数:80
相关论文
共 50 条
  • [1] Video description: A comprehensive survey of deep learning approaches
    Ghazala Rafiq
    Muhammad Rafiq
    Gyu Sang Choi
    Artificial Intelligence Review, 2023, 56 : 13293 - 13372
  • [2] Video restoration based on deep learning: a comprehensive survey
    Rota, Claudio
    Buzzelli, Marco
    Bianco, Simone
    Schettini, Raimondo
    ARTIFICIAL INTELLIGENCE REVIEW, 2023, 56 (06) : 5317 - 5364
  • [3] Video restoration based on deep learning: a comprehensive survey
    Claudio Rota
    Marco Buzzelli
    Simone Bianco
    Raimondo Schettini
    Artificial Intelligence Review, 2023, 56 : 5317 - 5364
  • [4] Deep Learning Approaches for Autonomous Driving a Comprehensive Survey
    Vasanthamma
    Dubey, Manoj
    Kantharaju, Kanaparthi
    Kollipara, Naga Venkateshwara Rao
    Sumalatha, M.
    METALLURGICAL & MATERIALS ENGINEERING, 2025, 31 (01) : 346 - 354
  • [5] A Comprehensive Survey of Deep Learning Approaches in Image Processing
    Trigka, Maria
    Dritsas, Elias
    SENSORS, 2025, 25 (02)
  • [6] Video Unsupervised Domain Adaptation with Deep Learning: A Comprehensive Survey
    Xu, Yuecong
    Cao, Haozhi
    Xie, Lihua
    Li, Xiao-Li
    Chen, Zhenghua
    Yang, Jianfei
    ACM COMPUTING SURVEYS, 2024, 56 (12)
  • [7] A Comprehensive Survey on Deep Learning Techniques for Digital Video Forensics
    Vigneshwaran, T.
    Velammal, B. L.
    JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT, 2024, 23 (03)
  • [8] A Comprehensive Review of Deep Learning Approaches for Animal Detection on Video Data
    Kumar, Prashanth
    Luo, Suhuai
    Shaukat, Kamran
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (11) : 1420 - 1437
  • [9] Video super-resolution based on deep learning: a comprehensive survey
    Liu, Hongying
    Ruan, Zhubo
    Zhao, Peng
    Dong, Chao
    Shang, Fanhua
    Liu, Yuanyuan
    Yang, Linlin
    Timofte, Radu
    ARTIFICIAL INTELLIGENCE REVIEW, 2022, 55 (08) : 5981 - 6035
  • [10] Exploring Video Captioning Techniques: A Comprehensive Survey on Deep Learning Methods
    Islam S.
    Dash A.
    Seum A.
    Raj A.H.
    Hossain T.
    Shah F.M.
    SN Computer Science, 2021, 2 (2)