Automatic Generation of Image Caption Based on Semantic Relation using Deep Visual Attention Prediction

被引:0
|
作者
El-gayar, M. M. [1 ,2 ]
机构
[1] Mansoura Univ Mansoura, Fac Comp & Informat, Dept Informat Technol, Mansoura 35516, Egypt
[2] New Mansoura Univ, Fac Comp Sci & Engn, New Mansoura, Egypt
关键词
Semantic image captioning; deep visual attention model; long short-term memory; wavelet driven convolutional neural network;
D O I
10.14569/IJACSA.2023.0140912
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
While modern systems for managing, retrieving, and analyzing images heavily rely on deriving semantic captions to categorize images, this task presents a considerable challenge due to the extensive capabilities required for manual processing, particularly with large images. Despite significant advancements in automatic image caption generation and human attention prediction through convolutional neural networks, there remains a need to enhance attention models in these networks through efficient multi-scale features utilization. Addressing this need, our study presents a novel image decoding model that integrates a wavelet-driven convolutional neural network with a dual-stage discrete wavelet transform, enabling the extraction of salient features within images. We utilize a wavelet-driven convolutional neural network as the encoder, coupled with a deep visual prediction model and Long Short-Term Memory as the decoder. The deep Visual Prediction Model calculates channel and location attention for visual attention features, with local features assessed by considering the spatial-contextual relationship among objects. Our primary contribution is to propose an encoder and decoder model to automatically create a semantic caption on the image based on the semantic contextual information and spatial features present in the image. Also, we improved the performance of this model, demonstrated through experiments conducted on three widely used datasets: Flickr8K, Flickr30K, and MSCOCO. The proposed approach outperformed current methods, achieving superior results in BLEU, METEOR, and GLEU scores. This research offers a significant advancement in image captioning and attention prediction models, presenting a promising direction for future work in this field.
引用
收藏
页码:105 / 114
页数:10
相关论文
共 50 条
  • [1] Image caption generation using Visual Attention Prediction and Contextual Spatial Relation Extraction
    Sasibhooshan, Reshmi
    Kumaraswamy, Suresh
    Sasidharan, Santhoshkumar
    [J]. JOURNAL OF BIG DATA, 2023, 10 (01)
  • [2] Image caption generation using Visual Attention Prediction and Contextual Spatial Relation Extraction
    Reshmi Sasibhooshan
    Suresh Kumaraswamy
    Santhoshkumar Sasidharan
    [J]. Journal of Big Data, 10
  • [3] Automatic image caption generation using deep learning and multimodal attention
    Dai, Jin
    Zhang, Xinyu
    [J]. COMPUTER ANIMATION AND VIRTUAL WORLDS, 2022, 33 (3-4)
  • [4] Automatic image caption generation using deep learning
    Verma, Akash
    Yadav, Arun Kumar
    Kumar, Mohit
    Yadav, Divakar
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (2) : 5309 - 5325
  • [5] GVA: guided visual attention approach for automatic image caption generation
    Hossen, Md. Bipul
    Ye, Zhongfu
    Abdussalam, Amr
    Hossain, Md. Imran
    [J]. MULTIMEDIA SYSTEMS, 2024, 30 (01)
  • [6] GVA: guided visual attention approach for automatic image caption generation
    Md. Bipul Hossen
    Zhongfu Ye
    Amr Abdussalam
    Md. Imran Hossain
    [J]. Multimedia Systems, 2024, 30
  • [7] Automatic image caption generation using deep learning
    Akash Verma
    Arun Kumar Yadav
    Mohit Kumar
    Divakar Yadav
    [J]. Multimedia Tools and Applications, 2024, 83 : 5309 - 5325
  • [8] Stack-VS: Stacked Visual-Semantic Attention for Image Caption Generation
    Cheng, Ling
    Wei, Wei
    Mao, Xianling
    Liu, Yong
    Miao, Chunyan
    [J]. IEEE ACCESS, 2020, 8 : 154953 - 154965
  • [9] A Deep Attention based Framework for Image Caption Generation in Hindi Language
    Dhir, Rijul
    Mishra, Santosh Kumar
    Saha, Sriparna
    Bhattacharyya, Pushpak
    [J]. COMPUTACION Y SISTEMAS, 2019, 23 (03): : 693 - 701
  • [10] Image caption based on Visual Attention Mechanism
    Zhou, Jinfei
    Zhu, Yaping
    Pan, Hong
    [J]. PROCEEDINGS OF 2019 INTERNATIONAL CONFERENCE ON IMAGE, VIDEO AND SIGNAL PROCESSING (IVSP 2019), 2019, : 28 - 32