A novel method for image captioning using multimodal feature fusion employing mask RNN and LSTM models

被引：10

作者：

Thangavel, Kumaravel ^{[1
]}

Palanisamy, Natesan ^{[2
]}

Muthusamy, Suresh ^{[3
]}

Mishra, Om Prava ^{[4
]}

Sundararajan, Suma Christal Mary ^{[5
]}

Panchal, Hitesh ^{[6
]}

Loganathan, Ashok Kumar ^{[7
]}

Ramamoorthi, Ponarun ^{[8
]}

机构：

[1] Kongu Engn Coll Autonomous, Dept Comp Sci & Engn, Erode, Tamil Nadu, India

[2] Kongu Engn Coll Autonomous, Dept Comp Sci & Engn, Erode, Tamil Nadu, India

[3] Kongu Engn Coll Autonomous, Dept Elect & Commun Engn, Erode, Tamil Nadu, India

[4] Vel Tech Rangarajan Dr Sagunthala R&D Inst Sci & T, Dept Elect & Commun Engn, Chennai, Tamil Nadu, India

[5] Panimalar Engn Coll Autonomous, Dept Informat Technol, Chennai, Tamil Nadu, India

[6] Govt Engn Coll, Dept Mech Engn, Patan, Gujarat, India

[7] PSG Coll Technol, Dept Elect & Elect Engn, Coimbatore, Tamil Nadu, India

[8] Theni Kammavar Sangam Coll Technol, Dept Elect & Elect Engn, Theni, Tamil Nadu, India

来源：

SOFT COMPUTING | 2023年 / 27卷 / 19期

关键词：

Image captioning; Mask RCNN; LSTM; Multimodal feature fusion; Semantic feature analysis;

D O I：

10.1007/s00500-023-08448-7

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Image captioning is a technique that allows us to use computers to interpret the information in photographs and make written text. The use of deep learning to interpret image information and create descriptive text has become a widely researched issue since its establishment. Nevertheless, these strategies do not identify all samples that depict conceptual ideas. In reality, the vast majority of them seem to be irrelevant to the matching tasks. The degree of similarity is determined only by a few relevant semantic occurrences. This duplicate instance can also be thought of as noise, as it obstructs the matching process of a few meaningful instances and adds to the model's computational effort. In the existing scheme, traditional convolutional neural networks (CNN) are presented. For that reason, captioning is not effective due to its structure. Furthermore, present approaches frequently require the deliberate use of additional target recognition algorithms or costly human labeling when extracting information is required. For image captioning, this research presents a multimodal feature fusion-based deep learning model. The coding layer uses mask recurrent neural networks (Faster RCNN), the long short-term memory has been used to decode, and the descriptive text is constructed. In deep learning, the model parameters are optimized through the method of gradient optimization. In the decoding layer, dense attention mechanisms can assist in minimizing non-salient data interruption and preferentially input the appropriate data for the decryption stage. Input images are used to train a model that, when given the opportunity, will provide captions that are very close to accurately describing the images. Various datasets are used to evaluate the model's precision and the fluency, or mastery, of the language it acquires by analyzing picture descriptions. Results from these tests demonstrate that the model consistently provides correct descriptions of input images. This model has been taught to provide captions or words describing an input picture. To measure the effectiveness of the model, the system is given categorization scores. With a batch size of 512 and 100 training epochs, the suggested system shows a 95% increase in performance. The model's capacity to comprehend images and generate text is validated by the experimental data in the domain of generic images. This paper is implemented using Python frameworks and also evaluated using performance metrics such as PSNR, RMSE, SSIM, accuracy, recall, F1-score, and precision.

引用

页码：14205 / 14218

页数：14

共 50 条

[31] A novel image fusion method using WBCT and PCA
苗启广
王宝树
Chinese Optics Letters, 2008, (02) : 104 - 107
[32] A novel image fusion method using WBCT and PCA
Miao, Qiguang
Wang, Baoshu
CHINESE OPTICS LETTERS, 2008, 6 (02) : 104 - 107
[33] Novel image fusion method using contourlet transform
Miao Qiguang
Wang Baoshu
2006 INTERNATIONAL CONFERENCE ON COMMUNICATIONS, CIRCUITS AND SYSTEMS PROCEEDINGS, VOLS 1-4: VOL 1: SIGNAL PROCESSING, 2006, : 548 - +
[34] A novel method for the 3-dimensional simulation of orthognathic surgery by using a multimodal image-fusion technique
Uechi, Jun
Okayama, Miki
Shibata, Takanori
Muguruma, Takeshi
Hayashi, Kazuo
Endo, Kazuhiko
Mizoguchi, Itaru
AMERICAN JOURNAL OF ORTHODONTICS AND DENTOFACIAL ORTHOPEDICS, 2006, 130 (06) : 786 - 798
[35] Multimodal Sequence Classification of force-based instrumented hand manipulation motions using LSTM-RNN deep learning models
Bhattacharjee, Abhinaba
Anwar, Sohel
Whitinger, Lexi
Loghmani, M. Terry
2023 IEEE EMBS INTERNATIONAL CONFERENCE ON BIOMEDICAL AND HEALTH INFORMATICS, BHI, 2023,
[36] Multimodal feature assessment using multibranch 3D CNN to BI-LSTM for feature level multi-polarization SAR image data fusion and vehicle identification
Arnous, Ferris, I
Narayanan, Ram M.
RADAR SENSOR TECHNOLOGY XXVII, 2023, 12535
[37] Feature fusion method using BoVW framework for enhancing image retrieval
Vimina, E. Ravindran
Jacob, K. Poulose
IET IMAGE PROCESSING, 2019, 13 (11) : 1979 - 1985
[38] A Noninvasive Body Setup Method for Radiotherapy by Using a Multimodal Image Fusion Technique
Zhang, Jie
Chen, Ying
Chen, Yunxia
Wang, Chenchen
Cai, Jing
Chu, Kaiyue
Jin, Jianhua
Ge, Yun
Huang, Xiaolin
Guan, Yue
Li, Weifeng
TECHNOLOGY IN CANCER RESEARCH & TREATMENT, 2017, 16 (06) : 1187 - 1193
[39] Novel Three-Stage Feature Fusion Method of Multimodal Data for Bearing Fault Diagnosis
Wang, Daichao
Li, Yibin
Jia, Lei
Song, Yan
Liu, Yanjun
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2021, 70
[40] A Novel Method of Multimodal Medical Image Fusion Based on Hybrid Approach of NSCT and DTCWT
Alseelawi, Nawar
Hazim, Hussein Tuama
ALRikabi, Haider Th Salim
INTERNATIONAL JOURNAL OF ONLINE AND BIOMEDICAL ENGINEERING, 2022, 18 (03) : 114 - 133

← 1 2 3 4 5 →