A novel method for image captioning using multimodal feature fusion employing mask RNN and LSTM models

被引:10
|
作者
Thangavel, Kumaravel [1 ]
Palanisamy, Natesan [2 ]
Muthusamy, Suresh [3 ]
Mishra, Om Prava [4 ]
Sundararajan, Suma Christal Mary [5 ]
Panchal, Hitesh [6 ]
Loganathan, Ashok Kumar [7 ]
Ramamoorthi, Ponarun [8 ]
机构
[1] Kongu Engn Coll Autonomous, Dept Comp Sci & Engn, Erode, Tamil Nadu, India
[2] Kongu Engn Coll Autonomous, Dept Comp Sci & Engn, Erode, Tamil Nadu, India
[3] Kongu Engn Coll Autonomous, Dept Elect & Commun Engn, Erode, Tamil Nadu, India
[4] Vel Tech Rangarajan Dr Sagunthala R&D Inst Sci & T, Dept Elect & Commun Engn, Chennai, Tamil Nadu, India
[5] Panimalar Engn Coll Autonomous, Dept Informat Technol, Chennai, Tamil Nadu, India
[6] Govt Engn Coll, Dept Mech Engn, Patan, Gujarat, India
[7] PSG Coll Technol, Dept Elect & Elect Engn, Coimbatore, Tamil Nadu, India
[8] Theni Kammavar Sangam Coll Technol, Dept Elect & Elect Engn, Theni, Tamil Nadu, India
关键词
Image captioning; Mask RCNN; LSTM; Multimodal feature fusion; Semantic feature analysis;
D O I
10.1007/s00500-023-08448-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image captioning is a technique that allows us to use computers to interpret the information in photographs and make written text. The use of deep learning to interpret image information and create descriptive text has become a widely researched issue since its establishment. Nevertheless, these strategies do not identify all samples that depict conceptual ideas. In reality, the vast majority of them seem to be irrelevant to the matching tasks. The degree of similarity is determined only by a few relevant semantic occurrences. This duplicate instance can also be thought of as noise, as it obstructs the matching process of a few meaningful instances and adds to the model's computational effort. In the existing scheme, traditional convolutional neural networks (CNN) are presented. For that reason, captioning is not effective due to its structure. Furthermore, present approaches frequently require the deliberate use of additional target recognition algorithms or costly human labeling when extracting information is required. For image captioning, this research presents a multimodal feature fusion-based deep learning model. The coding layer uses mask recurrent neural networks (Faster RCNN), the long short-term memory has been used to decode, and the descriptive text is constructed. In deep learning, the model parameters are optimized through the method of gradient optimization. In the decoding layer, dense attention mechanisms can assist in minimizing non-salient data interruption and preferentially input the appropriate data for the decryption stage. Input images are used to train a model that, when given the opportunity, will provide captions that are very close to accurately describing the images. Various datasets are used to evaluate the model's precision and the fluency, or mastery, of the language it acquires by analyzing picture descriptions. Results from these tests demonstrate that the model consistently provides correct descriptions of input images. This model has been taught to provide captions or words describing an input picture. To measure the effectiveness of the model, the system is given categorization scores. With a batch size of 512 and 100 training epochs, the suggested system shows a 95% increase in performance. The model's capacity to comprehend images and generate text is validated by the experimental data in the domain of generic images. This paper is implemented using Python frameworks and also evaluated using performance metrics such as PSNR, RMSE, SSIM, accuracy, recall, F1-score, and precision.
引用
收藏
页码:14205 / 14218
页数:14
相关论文
共 50 条
  • [21] New Method using Feature Level Image Fusion and Entropy Component Analysis for Multimodal Human Face Recognition
    Wu, Tao
    Wu, Xiao-Jun
    Liu, Xing
    Luo, Xiao-Qing
    2012 INTERNATIONAL WORKSHOP ON INFORMATION AND ELECTRONICS ENGINEERING, 2012, 29 : 3991 - 3995
  • [22] A Novel Seizure Detection Method Based on the Feature Fusion of Multimodal Physiological Signals
    Wu, Duanpo
    Wei, Jun
    Vidal, Pierre-Paul
    Wang, Danping
    Yuan, Yixuan
    Cao, Jiuwen
    Jiang, Tiejia
    IEEE INTERNET OF THINGS JOURNAL, 2024, 11 (16): : 27545 - 27556
  • [23] ENSEMBLE MODELS FOR MULTIMODAL SENTIMENT ANALYSIS USING TEXTUAL AND IMAGE FUSION
    Bolcas, Radu-Daniel
    Ciuc, Mihai
    Popovici, Eduard-Cristian
    UNIVERSITY POLITEHNICA OF BUCHAREST SCIENTIFIC BULLETIN SERIES C-ELECTRICAL ENGINEERING AND COMPUTER SCIENCE, 2024, 86 (04): : 279 - 290
  • [24] A Multimodal Image Registration Method for UAV Visual Navigation Based on Feature Fusion and Transformers
    He, Ruofei
    Long, Shuangxing
    Sun, Wei
    Liu, Hongjuan
    DRONES, 2024, 8 (11)
  • [25] A Novel Detection of Cerebrovascular Disease using Multimodal Medical Image Fusion
    Paul, Sudip
    Jain, Shruti
    RECENT ADVANCES IN INFLAMMATION & ALLERGY DRUG DISCOVERY, 2024, 18 (02): : 140 - 155
  • [26] A novel multimodal image feature fusion mechanism: Application to rabbit liveweight estimation in commercial farms
    Song, Daoyi
    Lai, Zhenhao
    Yang, Shuqi
    Liu, Dongyu
    Yao, Jinxia
    Wang, Hongying
    Wang, Liangju
    SMART AGRICULTURAL TECHNOLOGY, 2024, 9
  • [27] A novel method using LSTM-RNN to generate smart contracts code templates for improved usability
    Zhihao Hao
    Bob Zhang
    Dianhui Mao
    Jerome Yen
    Zhihua Zhao
    Min Zuo
    Haisheng Li
    Cheng-Zhong Xu
    Multimedia Tools and Applications, 2023, 82 : 41669 - 41699
  • [28] A novel method using LSTM-RNN to generate smart contracts code templates for improved usability
    Hao, Zhihao
    Zhang, Bob
    Mao, Dianhui
    Yen, Jerome
    Zhao, Zhihua
    Zuo, Min
    Li, Haisheng
    Xu, Cheng-Zhong
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (27) : 41669 - 41699
  • [29] Multimodal image fusion for ich detection and classification using parallel Dl models
    Nagaraju, Sri Sangepu
    Mary, S. Prince
    Chandra, V. Pavani
    Gayatri, Nandam
    COMPUTER METHODS IN BIOMECHANICS AND BIOMEDICAL ENGINEERING-IMAGING AND VISUALIZATION, 2025, 13 (01):
  • [30] A novel interpretable fault diagnosis method using multi-image feature extraction and attention fusion
    Wang, Jie
    Shao, Haidong
    He, Jing
    Liu, Le
    Ma, Jingqiang
    Liu, Bin
    PATTERN RECOGNITION LETTERS, 2025, 189