Clustering-based mask recovery for image captioning

被引:0
|
作者
Liang, Xu [1 ]
Li, Chen [1 ]
Tian, Lihua [1 ]
机构
[1] Xi An Jiao Tong Univ, Sch Software Engn, Xian 710000, Peoples R China
关键词
Transformer; Image captioning; Position information; Self-attention; Reinforcement learning; Cluster; Region feature; Grid feature; TRANSFORMER;
D O I
10.1016/j.neucom.2024.128127
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Region features play a huge role in image captioning. However, obtaining region features requires pre-training an object detector by using a large number of object detection datasets. It may be impossible for the end-toend training. And if there is a large distribution difference between the object detection datasets and the image captioning datasets, the object detector may not be able to extract accurate region features. This makes it limited in application. In this paper, we propose a clustering-based mask recovery for image captioning. In the encoder, the pseudo-region features are obtained by clustering the grid features, which are extracted using Swin Transformer. Then we input the grid features together with the pseudo-region features into the decoder, and make the model to dynamically learns the weights of the two features in the decoding process to minimize the effect of errors caused by clustering. By using a clustering method to generate pseudo-region features for images, not only does the training process become end-to-end, but there is no need to introduce additional object detection datasets to train the object detector. In addition, the Transformer decoder has a misplaced problem in the decoding process. This means that the positional information used by the model when generating a word is not the same as the positional information used when it continues to use the word to reason. This may have some negative impact on the position encoding of the model. Therefore, we changed the original decoding method to mask recovery. Furthermore, a masked multi-head attention module with relative position is proposed in the decoder to integrate the information in the fusion features, and reconstruct the relative position relationship between words. We conduct experiments on MSCOCO 2014 dataset. The experiment results show that our model obtains 144.3% (single model) and 147.0% (ensemble of 4 models) CIDEr scores on 'Karpathy' offline test split, and 143.2% (c40) CIDEr scores on the official online test server.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] Survey on Clustering-Based Image Segmentation Techniques
    Zou, Yanni
    Liu, Bo
    [J]. 2016 IEEE 20th International Conference on Computer Supported Cooperative Work in Design (CSCWD), 2016, : 106 - 110
  • [2] Robust fuzzy clustering-based image segmentation
    Yang, Zhang
    Chung, Fu-Lai
    Wang Shitong
    [J]. APPLIED SOFT COMPUTING, 2009, 9 (01) : 80 - 84
  • [3] A clustering-based possibilistic method for image classification
    Drummond, I
    Sandri, S
    [J]. ADVANCES IN ARTIFICIAL INTELLIGENCE - SBIA 2004, 2004, 3171 : 454 - 463
  • [4] Image captioning with data augmentation using cropping and mask based on attention image
    Iwamura K.
    Louhi Kasahara J.Y.
    Moro A.
    Yamashita A.
    Asama H.
    [J]. Seimitsu Kogaku Kaishi/Journal of the Japan Society for Precision Engineering, 2020, 86 (11): : 904 - 910
  • [5] Clustering-based quantisation for PDE-based image compression
    Hoeltgen, Laurent
    Peter, Pascal
    Breu, Michael
    [J]. SIGNAL IMAGE AND VIDEO PROCESSING, 2018, 12 (03) : 411 - 419
  • [6] Clustering-based quantisation for PDE-based image compression
    Laurent Hoeltgen
    Pascal Peter
    Michael Breuß
    [J]. Signal, Image and Video Processing, 2018, 12 : 411 - 419
  • [7] Automatic microarray image segmentation with clustering-based algorithms
    Shao, Guifang
    Li, Dongyao
    Zhang, Junfa
    Yang, Jianbo
    Shangguan, Yali
    [J]. PLOS ONE, 2019, 14 (01):
  • [8] A Clustering-based Approach to Web Image Context Extraction
    Alcic, Sadet
    Conrad, Stefan
    [J]. PROCEEDINGS OF THE THIRD INTERNATIONAL CONFERENCES ON ADVANCES IN MULTIMEDIA (MMEDIA 2011), 2011, : 74 - 79
  • [9] Clustering-based Image Segmentation using Automatic GrabCut
    Khattab, Dina
    Ebeid, Hala M.
    Tolba, Mohamed F.
    Hussein, Ashraf S.
    [J]. INTERNATIONAL CONFERENCE ON INFORMATICS AND SYSTEMS (INFOS 2016), 2016, : 95 - 100
  • [10] A CLUSTERING-BASED APPROACH FOR EVALUATION OF EO IMAGE INDEXING
    Bahmanyar, Reza
    Rigoll, Gerhard
    Datcu, Mihai
    [J]. SMPR CONFERENCE 2013, 2013, 40-1-W3 : 79 - 84