An efficient deep learning-based video captioning framework using multi-modal features

被引：2

作者：

Varma, Soumya ^{[1
]}

James, Dinesh Peter ^{[1
]}

机构：

[1] Karunya Inst Technol & Sci, Dept Comp Sci & Engn, Coimbatore 641114, Tamil Nadu, India

来源：

EXPERT SYSTEMS | 2021年

关键词：

attention context; encoder-decoder framework; language model; quantum machine learning; video captioning;

D O I：

10.1111/exsy.12920

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual understanding has become more significant in gathering information in many real-life applications. For a human, it is a trivial task to understand the content in a visual, however the same is a challenging task for a machine. Generating captions for images and videos for better understanding the situation is gaining more importance as they have wide application in assistive technologies, automatic video captioning, video summarizing, subtitling, blind navigation, and so on. The visual understanding framework will analyse the content present in the video to generate semantically accurate caption for the visual. Apart from the visual understanding of the situation, the gained semantics must be represented in a natural language like English, for which we require a language model. Hence, the semantics and grammar of the sentences being generated in English is yet another challenge. The captured description of the video is supposed to collect information of not just the objects contained in the scene, but it should also express how these objects are related to each other through the activity described in the scene, thus making the entire process a complex task for a machine. This work is an attempt to peep into the various methods for video captioning using deep learning methodologies, datasets that are widely used for these tasks and various evaluation metrics that are used for the performance comparison. The insights that we gained from our premiere work and the extensive literature review made us capable of proposing a practical, efficient video captioning architecture using deep learning which that will utilize the audio clues, external knowledge and attention context to improve the captioning process. Quantum deep learning architectures can bring about extraordinary results in object recognition tasks and feature extraction using convolutions.

引用

页数：16

共 50 条

[1] Multi-modal Dense Video Captioning
Iashin, Vladimir
Rahtu, Esa
[J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, : 4117 - 4126
[2] Effective deep learning-based multi-modal retrieval
Wang, Wei
Yang, Xiaoyan
Ooi, Beng Chin
Zhang, Dongxiang
Zhuang, Yueting
[J]. VLDB JOURNAL, 2016, 25 (01): : 79 - 101
[3] Effective deep learning-based multi-modal retrieval
Wei Wang
Xiaoyan Yang
Beng Chin Ooi
Dongxiang Zhang
Yueting Zhuang
[J]. The VLDB Journal, 2016, 25 : 79 - 101
[4] Multi-modal Dependency Tree for Video Captioning
Zhao, Wentian
Wu, Xinxiao
Luo, Jiebo
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[5] Applying deep learning-based multi-modal for detection of coronavirus
Rani, Geeta
Oza, Meet Ganpatlal
Dhaka, Vijaypal Singh
Pradhan, Nitesh
Verma, Sahil
Rodrigues, Joel J. P. C.
[J]. MULTIMEDIA SYSTEMS, 2022, 28 (04) : 1251 - 1262
[6] Applying deep learning-based multi-modal for detection of coronavirus
Geeta Rani
Meet Ganpatlal Oza
Vijaypal Singh Dhaka
Nitesh Pradhan
Sahil Verma
Joel J. P. C. Rodrigues
[J]. Multimedia Systems, 2022, 28 : 1251 - 1262
[7] Deep Learning-Based CNN Multi-Modal Camera Model Identification for Video Source Identification
Singh S.
Sehgal V.K.
[J]. Informatica (Slovenia), 2023, 47 (03): : 417 - 430
[8] ReCoAt: A Deep Learning-based Framework for Multi-Modal Motion Prediction in Autonomous Driving Application
Huang, Zhiyu
Mo, Xiaoyu
Lv, Chen
[J]. 2022 IEEE 25TH INTERNATIONAL CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS (ITSC), 2022, : 988 - 993
[9] Multi-Modal Deep Learning-Based Violin Bowing Action Recognition
Liu, Bao-Yun
Jen, Yi-Hsin
Sun, Shih-Wei
Su, Li
Chang, Pao-Chi
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - TAIWAN (ICCE-TAIWAN), 2020,
[10] MULTI-MODAL HIERARCHICAL ATTENTION-BASED DENSE VIDEO CAPTIONING
Munusamy, Hemalatha
Sekhar, Chandra C.
[J]. 2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 475 - 479

← 1 2 3 4 5 →