Fine-Grained Image Captioning With Global-Local Discriminative Objective

被引:48
|
作者
Wu, Jie [1 ]
Chen, Tianshui [4 ,5 ]
Wu, Hefeng [2 ]
Yang, Zhi [3 ]
Luo, Guangchun [6 ]
Lin, Liang [4 ,5 ]
机构
[1] Sun Yat Sen Univ, Sch Elect & Informat Engn, Guangzhou 515000, Peoples R China
[2] Sun Yat Sen Univ, Sch Data & Comp Sci, Guangzhou 515000, Peoples R China
[3] Sun Yat Sen Univ, Guangzhou 515000, Peoples R China
[4] Sun Yat Sen Univ, Guangzhou 510006, Peoples R China
[5] Dark Matter Res, Guangzhou 510006, Peoples R China
[6] Univ Elect Sci & Technol China, Chengdu 610051, Peoples R China
基金
中国国家自然科学基金;
关键词
Training; Visualization; Task analysis; Semantics; Reinforcement learning; Pipelines; Maximum likelihood estimation; Image captioning; Fine-grained captions; Global discriminative constraint; Local discriminative constraint; Self-retrieval; TEXT;
D O I
10.1109/TMM.2020.3011317
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Significant progress has been made in recent years in image captioning, an active topic in the fields of vision and language. However, existing methods tend to yield overly general captions and consist of some of the most frequent words/phrases, resulting in inaccurate and indistinguishable descriptions (see Fig. 1). This is primarily due to (i) the conservative characteristic of traditional training objectives that drives the model to generate correct but hardly discriminative captions for similar images and (ii) the uneven word distribution of the ground-truth captions, which encourages generating highly frequent words/phrases while suppressing the less frequent but more concrete ones. In this work, we propose a novel global-local discriminative objective that is formulated on top of a reference model to facilitate generating fine-grained descriptive captions. Specifically, from a global perspective, we design a novel global discriminative constraint that pulls the generated sentence to better discern the corresponding image from all others in the entire dataset. From the local perspective, a local discriminative constraint is proposed to increase attention such that it emphasizes the less frequent but more concrete words/phrases, thus facilitating the generation of captions that better describe the visual details of the given images. We evaluate the proposed method on the widely used MS-COCO dataset, where it outperforms the baseline methods by a sizable margin and achieves competitive performance over existing leading approaches. We also conduct self-retrieval experiments to demonstrate the discriminability of the proposed method.
引用
收藏
页码:2413 / 2427
页数:15
相关论文
共 50 条
  • [41] High-Quality Image Captioning With Fine-Grained and Semantic-Guided Visual Attention
    Zhang, Zongjian
    Wu, Qiang
    Wang, Yang
    Chen, Fang
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (07) : 1681 - 1693
  • [42] Attribute-Driven Filtering: A new attributes predicting approach for fine-grained image captioning
    Hossen, Md. Bipul
    Ye, Zhongfu
    Abdussalam, Amr
    Ul Hassan, Shabih
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 137
  • [43] A GLOBAL-LOCAL CONTRASTIVE LEARNING FRAMEWORK FOR VIDEO CAPTIONING
    Huang, Qunyue
    Fang, Bin
    Ai, Xi
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 2410 - 2414
  • [44] Fine-Grained Image-Text Retrieval via Discriminative Latent Space Learning
    Zheng, Min
    Wang, Wen
    Li, Qingyong
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2021, 28 (28) : 643 - 647
  • [45] Discriminative semantic region selection for fine-grained recognition
    Zhang, Chunjie
    Wang, Da-Han
    Li, Haisheng
    [J]. JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2021, 77
  • [46] Discriminative Suprasphere Embedding for Fine-Grained Visual Categorization
    Ye, Shuo
    Peng, Qinmu
    Sun, Wenju
    Xu, Jiamiao
    Wang, Yu
    You, Xinge
    Cheung, Yiu-Ming
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (04) : 5092 - 5102
  • [47] Mining Discriminative Triplets of Patches for Fine-Grained Classification
    Wang, Yaming
    Choi, Jonghyun
    Morariu, Vlad I.
    Davis, Larry S.
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 1163 - 1172
  • [48] Image local structure information learning for fine-grained visual classification
    Jin Lu
    Weichuan Zhang
    Yali Zhao
    Changming Sun
    [J]. Scientific Reports, 12
  • [49] Image local structure information learning for fine-grained visual classification
    Lu, Jin
    Zhang, Weichuan
    Zhao, Yali
    Sun, Changming
    [J]. SCIENTIFIC REPORTS, 2022, 12 (01)
  • [50] Improved fine-grained object retrieval with Hard Global Softmin Loss objective
    Wang, Xiaodong
    Zeng, Xianxian
    Zhang, Yun
    Chen, Kairui
    Li, Dong
    [J]. SIGNAL PROCESSING-IMAGE COMMUNICATION, 2022, 100