Local-to-Global Semantic Supervised Learning for Image Captioning

被引:0
|
作者
Wang, Juan [1 ]
Duan, Yiping [1 ]
Tao, Xiaoming [1 ]
Lu, Jianhua [1 ]
机构
[1] Tsinghua Univ, Beijing Natl Res Ctr Informat Sci & Technol BNRis, Dept Elect Engn, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
image caption; semantic supervised learning; attention mechanism; ATTENTION;
D O I
10.1109/icc40277.2020.9149264
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Image captioning is a challenging problem owing to the complexity of image content and the diverse ways of describing the content in natural language. Although current methods have made substantial progress in terms of objective metrics (such as BLEU, METEOR, ROUGE-L and CIDEr), there still exist some problems. Specifically, most of these methods are trained to maximize the log-likelihood or objective metrics. As a result, these methods often generate rigid and semantically incomplete captions. In this paper, we develop a new model that aims to generate captions conforming to human evaluation. The core idea is to use local-to-global semantic supervised learning by introducing the two-level optimization objective functions. At the word level, we match each word to the image regions using the local attention objective function; at the sentence level, we align the entire sentence and the image using the global semantic objective function. Experimentally, we compare the proposed model with current methods on MSCOCO dataset. We show that either local attention supervision or global semantic supervision is the necessary component for the success of our model through ablation studies. Furthermore, combining these two supervision objective functions achieves state-of-the-art performance in terms of both standard evaluation metrics and human judgment.
引用
收藏
页数:6
相关论文
共 50 条
  • [41] Semantic Tensor Product for Image Captioning
    Sur, Chiranjib
    Liu, Pei
    Zhou, Yingjie
    Wu, Dapeng
    [J]. 5TH INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING AND COMMUNICATIONS (BIGCOM 2019), 2019, : 33 - 37
  • [42] Cascade Semantic Fusion for Image Captioning
    Wang, Shiwei
    Lan, Long
    Zhang, Xiang
    Dong, Guohua
    Luo, Zhigang
    [J]. IEEE ACCESS, 2019, 7 : 66680 - 66688
  • [43] Formulating semantic image annotation as a supervised learning problem
    Carneiro, G
    Vasconcelos, N
    [J]. 2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 2, PROCEEDINGS, 2005, : 163 - 168
  • [44] Object semantic analysis for image captioning
    Du, Sen
    Zhu, Hong
    Lin, Guangfeng
    Wang, Dong
    Shi, Jing
    Wang, Jing
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (28) : 43179 - 43206
  • [45] Image Piece Learning for Weakly Supervised Semantic Segmentation
    Li, Yi
    Guo, Yanqing
    Kao, Yueying
    He, Ran
    [J]. IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2017, 47 (04): : 648 - 659
  • [46] Supervised learning of semantic classes for image annotation and retrieval
    Carneiro, Gustavo
    Chan, Antoni B.
    Moreno, Pedro J.
    Vasconcelos, Nuno
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2007, 29 (03) : 394 - 410
  • [47] Transformer-based local-global guidance for image captioning
    Parvin, Hashem
    Naghsh-Nilchi, Ahmad Reza
    Mohammadi, Hossein Mahvash
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 223
  • [48] On the local-to-global principle for value sets
    Corvaja, Pietro
    [J]. RIVISTA DI MATEMATICA DELLA UNIVERSITA DI PARMA, 2022, 13 (01): : 47 - 72
  • [49] Local-to-global Urysohn width estimates
    Balitskiy, Alexey
    Berdnikov, Aleksandr
    [J]. JOURNAL FUR DIE REINE UND ANGEWANDTE MATHEMATIK, 2021, 780 : 265 - 274
  • [50] Separability in Morse local-to-global groups
    Mineh, Lawk
    Spriano, Davide
    [J]. BULLETIN OF THE LONDON MATHEMATICAL SOCIETY, 2024,