Fine-Grained Image Captioning With Global-Local Discriminative Objective

被引:48
|
作者
Wu, Jie [1 ]
Chen, Tianshui [4 ,5 ]
Wu, Hefeng [2 ]
Yang, Zhi [3 ]
Luo, Guangchun [6 ]
Lin, Liang [4 ,5 ]
机构
[1] Sun Yat Sen Univ, Sch Elect & Informat Engn, Guangzhou 515000, Peoples R China
[2] Sun Yat Sen Univ, Sch Data & Comp Sci, Guangzhou 515000, Peoples R China
[3] Sun Yat Sen Univ, Guangzhou 515000, Peoples R China
[4] Sun Yat Sen Univ, Guangzhou 510006, Peoples R China
[5] Dark Matter Res, Guangzhou 510006, Peoples R China
[6] Univ Elect Sci & Technol China, Chengdu 610051, Peoples R China
基金
中国国家自然科学基金;
关键词
Training; Visualization; Task analysis; Semantics; Reinforcement learning; Pipelines; Maximum likelihood estimation; Image captioning; Fine-grained captions; Global discriminative constraint; Local discriminative constraint; Self-retrieval; TEXT;
D O I
10.1109/TMM.2020.3011317
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Significant progress has been made in recent years in image captioning, an active topic in the fields of vision and language. However, existing methods tend to yield overly general captions and consist of some of the most frequent words/phrases, resulting in inaccurate and indistinguishable descriptions (see Fig. 1). This is primarily due to (i) the conservative characteristic of traditional training objectives that drives the model to generate correct but hardly discriminative captions for similar images and (ii) the uneven word distribution of the ground-truth captions, which encourages generating highly frequent words/phrases while suppressing the less frequent but more concrete ones. In this work, we propose a novel global-local discriminative objective that is formulated on top of a reference model to facilitate generating fine-grained descriptive captions. Specifically, from a global perspective, we design a novel global discriminative constraint that pulls the generated sentence to better discern the corresponding image from all others in the entire dataset. From the local perspective, a local discriminative constraint is proposed to increase attention such that it emphasizes the less frequent but more concrete words/phrases, thus facilitating the generation of captions that better describe the visual details of the given images. We evaluate the proposed method on the widely used MS-COCO dataset, where it outperforms the baseline methods by a sizable margin and achieves competitive performance over existing leading approaches. We also conduct self-retrieval experiments to demonstrate the discriminability of the proposed method.
引用
收藏
页码:2413 / 2427
页数:15
相关论文
共 50 条
  • [1] Fine-Grained Features for Image Captioning
    Shao, Mengyue
    Feng, Jie
    Wu, Jie
    Zhang, Haixiang
    Zheng, Yayu
    [J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 75 (03): : 4697 - 4712
  • [2] Global-Local Feature Extraction Method for Fine-Grained National Clothing Image Retrieval
    Zhou, Qianqian
    Liu, Li
    Liu, Lijun
    Fu, Xiaodong
    Huang, Qingsong
    [J]. Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence, 2021, 34 (05): : 463 - 472
  • [3] Fine-grained imbalanced leukocyte classification with global-local attention transformer
    Chen, Ben
    Qin, Feiwei
    Shao, Yanli
    Cao, Jin
    Peng, Yong
    Ge, Ruiquan
    [J]. JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2023, 35 (08)
  • [4] Deepfake Detection via Fine-Grained Classification and Global-Local Information Fusion
    Li, Tonghui
    Guo, Yuanfang
    Wang, Yunhong
    [J]. PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT VI, 2024, 14430 : 309 - 321
  • [5] Detecting Facial Action Units From Global-Local Fine-Grained Expressions
    Zhang, Wei
    Li, Lincheng
    Ding, Yu
    Chen, Wei
    Deng, Zhigang
    Yu, Xin
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (02) : 983 - 994
  • [6] GLCM: Global-Local Captioning Model for Remote Sensing Image Captioning
    Wang, Qi
    Huang, Wei
    Zhang, Xueting
    Li, Xuelong
    [J]. IEEE TRANSACTIONS ON CYBERNETICS, 2023, 53 (11) : 6910 - 6922
  • [7] GLAVNet: Global-Local Audio-Visual Cues for Fine-Grained Material Recognition
    Shi, Fengmin
    Guo, Jie
    Zhang, Haonan
    Yang, Shan
    Wang, Xiying
    Guo, Yanwen
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 14428 - 14437
  • [8] FineFormer: Fine-Grained Adaptive Object Transformer for Image Captioning
    Wang, Bo
    Zhang, Zhao
    Fan, Jicong
    Zhao, Mingbo
    Zhan, Choujun
    Xu, Mingliang
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2022, : 508 - 517
  • [9] Global-local feature learning for fine-grained food classification based on Swin Transformer
    Kim, Jun-Hwa
    Kim, Namho
    Won, Chee Sun
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 133
  • [10] c-RNN: A Fine-Grained Language Model for Image Captioning
    Gengshi Huang
    Haifeng Hu
    [J]. Neural Processing Letters, 2019, 49 : 683 - 691