A novel method for image captioning using multimodal feature fusion employing mask RNN and LSTM models

被引:10
|
作者
Thangavel, Kumaravel [1 ]
Palanisamy, Natesan [2 ]
Muthusamy, Suresh [3 ]
Mishra, Om Prava [4 ]
Sundararajan, Suma Christal Mary [5 ]
Panchal, Hitesh [6 ]
Loganathan, Ashok Kumar [7 ]
Ramamoorthi, Ponarun [8 ]
机构
[1] Kongu Engn Coll Autonomous, Dept Comp Sci & Engn, Erode, Tamil Nadu, India
[2] Kongu Engn Coll Autonomous, Dept Comp Sci & Engn, Erode, Tamil Nadu, India
[3] Kongu Engn Coll Autonomous, Dept Elect & Commun Engn, Erode, Tamil Nadu, India
[4] Vel Tech Rangarajan Dr Sagunthala R&D Inst Sci & T, Dept Elect & Commun Engn, Chennai, Tamil Nadu, India
[5] Panimalar Engn Coll Autonomous, Dept Informat Technol, Chennai, Tamil Nadu, India
[6] Govt Engn Coll, Dept Mech Engn, Patan, Gujarat, India
[7] PSG Coll Technol, Dept Elect & Elect Engn, Coimbatore, Tamil Nadu, India
[8] Theni Kammavar Sangam Coll Technol, Dept Elect & Elect Engn, Theni, Tamil Nadu, India
关键词
Image captioning; Mask RCNN; LSTM; Multimodal feature fusion; Semantic feature analysis;
D O I
10.1007/s00500-023-08448-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image captioning is a technique that allows us to use computers to interpret the information in photographs and make written text. The use of deep learning to interpret image information and create descriptive text has become a widely researched issue since its establishment. Nevertheless, these strategies do not identify all samples that depict conceptual ideas. In reality, the vast majority of them seem to be irrelevant to the matching tasks. The degree of similarity is determined only by a few relevant semantic occurrences. This duplicate instance can also be thought of as noise, as it obstructs the matching process of a few meaningful instances and adds to the model's computational effort. In the existing scheme, traditional convolutional neural networks (CNN) are presented. For that reason, captioning is not effective due to its structure. Furthermore, present approaches frequently require the deliberate use of additional target recognition algorithms or costly human labeling when extracting information is required. For image captioning, this research presents a multimodal feature fusion-based deep learning model. The coding layer uses mask recurrent neural networks (Faster RCNN), the long short-term memory has been used to decode, and the descriptive text is constructed. In deep learning, the model parameters are optimized through the method of gradient optimization. In the decoding layer, dense attention mechanisms can assist in minimizing non-salient data interruption and preferentially input the appropriate data for the decryption stage. Input images are used to train a model that, when given the opportunity, will provide captions that are very close to accurately describing the images. Various datasets are used to evaluate the model's precision and the fluency, or mastery, of the language it acquires by analyzing picture descriptions. Results from these tests demonstrate that the model consistently provides correct descriptions of input images. This model has been taught to provide captions or words describing an input picture. To measure the effectiveness of the model, the system is given categorization scores. With a batch size of 512 and 100 training epochs, the suggested system shows a 95% increase in performance. The model's capacity to comprehend images and generate text is validated by the experimental data in the domain of generic images. This paper is implemented using Python frameworks and also evaluated using performance metrics such as PSNR, RMSE, SSIM, accuracy, recall, F1-score, and precision.
引用
收藏
页码:14205 / 14218
页数:14
相关论文
共 50 条
  • [41] IBFusion: An Infrared and Visible Image Fusion Method Based on Infrared Target Mask and Bimodal Feature Extraction Strategy
    Bai, Yang
    Gao, Meijing
    Li, Shiyu
    Wang, Ping
    Guan, Ning
    Yin, Haozheng
    Yan, Yonghao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 10610 - 10622
  • [42] Development of next generation mask inspection method by using the feature of mask image captured with 199 nm inspection optics
    Tsuji, Yoshitake
    Kikuiri, Nobutaka
    Murakami, Shingo
    Takahara, Kenichi
    Isomura, Ikunao
    Tamura, Yukio
    Yamashita, Kyoji
    Hirano, Ryoichi
    Tateno, Motonari
    Matsumura, Kenichi
    Takayama, Naohisa
    Usuda, Kinya
    PHOTOMASK TECHNOLOGY 2006, PTS 1 AND 2, 2006, 6349
  • [43] A novel multimodal medical image fusion using sparse representation and modified spatial frequency
    Aishwarya, N.
    Thangammal, C. Bennila
    INTERNATIONAL JOURNAL OF IMAGING SYSTEMS AND TECHNOLOGY, 2018, 28 (03) : 175 - 185
  • [44] A novel method for enhancement of radiometric resolution using image fusion
    Rao, Ch. Venkateswara
    Rao, K. M. M.
    Reddy, P. Shasidhar
    Pujar, Girish
    INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2008, 10 (02): : 165 - 174
  • [45] Graphic association learning: Multimodal feature extraction and fusion of image and text using artificial intelligence techniques
    Lu, Guangyun
    Ni, Zhiping
    Wei, Ling
    Cheng, Junwei
    Huang, Wei
    HELIYON, 2024, 10 (18)
  • [46] A novel feature fusion-based stratum image recognition method for drilling rig
    Zhengyan Wu
    Jilin He
    Chao Huang
    Renshan Yao
    Earth Science Informatics, 2023, 16 : 4293 - 4311
  • [47] A novel feature fusion-based stratum image recognition method for drilling rig
    Wu, Zhengyan
    He, Jilin
    Huang, Chao
    Yao, Renshan
    EARTH SCIENCE INFORMATICS, 2023, 16 (04) : 4293 - 4311
  • [48] Research on registration method for enface image using multi-feature fusion
    Pan, Lingjiao
    Cai, Zhongwang
    Hu, Derong
    Zhu, Weifang
    Shi, Fei
    Tao, Weige
    Wu, Quanyu
    Xiao, Shuyan
    Chen, Xinjian
    PHYSICS IN MEDICINE AND BIOLOGY, 2024, 69 (21):
  • [49] A method for medical image retrieval using multi-level feature fusion
    Song, Weihua
    Han, Jing
    Hua, Tingting
    Journal of Information and Computational Science, 2009, 6 (02): : 967 - 974
  • [50] Multimodal human eye blink recognition method using feature level fusion for exigency detection
    Lamba, Puneet Singh
    Virmani, Deepali
    Castillo, Oscar
    SOFT COMPUTING, 2020, 24 (22) : 16829 - 16845