An efficient deep learning-based video captioning framework using multi-modal features

被引:2
|
作者
Varma, Soumya [1 ]
James, Dinesh Peter [1 ]
机构
[1] Karunya Inst Technol & Sci, Dept Comp Sci & Engn, Coimbatore 641114, Tamil Nadu, India
关键词
attention context; encoder-decoder framework; language model; quantum machine learning; video captioning;
D O I
10.1111/exsy.12920
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual understanding has become more significant in gathering information in many real-life applications. For a human, it is a trivial task to understand the content in a visual, however the same is a challenging task for a machine. Generating captions for images and videos for better understanding the situation is gaining more importance as they have wide application in assistive technologies, automatic video captioning, video summarizing, subtitling, blind navigation, and so on. The visual understanding framework will analyse the content present in the video to generate semantically accurate caption for the visual. Apart from the visual understanding of the situation, the gained semantics must be represented in a natural language like English, for which we require a language model. Hence, the semantics and grammar of the sentences being generated in English is yet another challenge. The captured description of the video is supposed to collect information of not just the objects contained in the scene, but it should also express how these objects are related to each other through the activity described in the scene, thus making the entire process a complex task for a machine. This work is an attempt to peep into the various methods for video captioning using deep learning methodologies, datasets that are widely used for these tasks and various evaluation metrics that are used for the performance comparison. The insights that we gained from our premiere work and the extensive literature review made us capable of proposing a practical, efficient video captioning architecture using deep learning which that will utilize the audio clues, external knowledge and attention context to improve the captioning process. Quantum deep learning architectures can bring about extraordinary results in object recognition tasks and feature extraction using convolutions.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] Multi-modal Dense Video Captioning
    Iashin, Vladimir
    Rahtu, Esa
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, : 4117 - 4126
  • [2] Effective deep learning-based multi-modal retrieval
    Wang, Wei
    Yang, Xiaoyan
    Ooi, Beng Chin
    Zhang, Dongxiang
    Zhuang, Yueting
    [J]. VLDB JOURNAL, 2016, 25 (01): : 79 - 101
  • [3] Effective deep learning-based multi-modal retrieval
    Wei Wang
    Xiaoyan Yang
    Beng Chin Ooi
    Dongxiang Zhang
    Yueting Zhuang
    [J]. The VLDB Journal, 2016, 25 : 79 - 101
  • [4] Multi-modal Dependency Tree for Video Captioning
    Zhao, Wentian
    Wu, Xinxiao
    Luo, Jiebo
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [5] Applying deep learning-based multi-modal for detection of coronavirus
    Rani, Geeta
    Oza, Meet Ganpatlal
    Dhaka, Vijaypal Singh
    Pradhan, Nitesh
    Verma, Sahil
    Rodrigues, Joel J. P. C.
    [J]. MULTIMEDIA SYSTEMS, 2022, 28 (04) : 1251 - 1262
  • [6] Applying deep learning-based multi-modal for detection of coronavirus
    Geeta Rani
    Meet Ganpatlal Oza
    Vijaypal Singh Dhaka
    Nitesh Pradhan
    Sahil Verma
    Joel J. P. C. Rodrigues
    [J]. Multimedia Systems, 2022, 28 : 1251 - 1262
  • [7] Deep Learning-Based CNN Multi-Modal Camera Model Identification for Video Source Identification
    Singh S.
    Sehgal V.K.
    [J]. Informatica (Slovenia), 2023, 47 (03): : 417 - 430
  • [8] ReCoAt: A Deep Learning-based Framework for Multi-Modal Motion Prediction in Autonomous Driving Application
    Huang, Zhiyu
    Mo, Xiaoyu
    Lv, Chen
    [J]. 2022 IEEE 25TH INTERNATIONAL CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS (ITSC), 2022, : 988 - 993
  • [9] Multi-Modal Deep Learning-Based Violin Bowing Action Recognition
    Liu, Bao-Yun
    Jen, Yi-Hsin
    Sun, Shih-Wei
    Su, Li
    Chang, Pao-Chi
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - TAIWAN (ICCE-TAIWAN), 2020,
  • [10] MULTI-MODAL HIERARCHICAL ATTENTION-BASED DENSE VIDEO CAPTIONING
    Munusamy, Hemalatha
    Sekhar, Chandra C.
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 475 - 479