Rethinking the Reference-based Distinctive Image Captioning

被引:6
|
作者
Mao, Yangjun [1 ]
Chen, Long [2 ]
Jiang, Zhihong [1 ]
Zhang, Dong [3 ]
Zhang, Zhimeng [1 ]
Shao, Jian [1 ]
Xiao, Jun [1 ]
机构
[1] Zhejiang Univ, Hangzhou, Peoples R China
[2] Columbia Univ, New York, NY USA
[3] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
基金
浙江省自然科学基金; 中国国家自然科学基金;
关键词
Image Captioning; Distinctiveness; Benchmark; Transformer;
D O I
10.1145/3503161.3548358
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Distinctive Image Captioning (DIC) - generating distinctive captions that describe the unique details of a target image - has received considerable attention over the last few years. A recent DIC work proposes to generate distinctive captions by comparing the target image with a set of semantic-similar reference images, i.e., reference-based DIC (Ref-DIC). It aims to make the generated captions can tell apart the target and reference images. Unfortunately, reference images used by existing Ref-DIC works are easy to distinguish: these reference images only resemble the target image at scene-level and have few common objects, such that a Ref-DIC model can trivially generate distinctive captions even without considering the reference images. For example, if the target image contains objects "towel" and "toilet" while all reference images are without them, then a simple caption "A bathroom with a towel and a toilet" is distinctive enough to tell apart target and reference images. To ensure Ref-DIC models really perceive the unique objects (or attributes) in target images, we first propose two new Ref-DIC benchmarks. Specifically, we design a two-stage matching mechanism, which strictly controls the similarity between the target and reference images at object-/attribute- level (vs. scene-level). Secondly, to generate distinctive captions, we develop a strong Transformer-based Ref-DIC baseline, dubbed as TransDIC. It not only extracts visual features from the target image, but also encodes the differences between objects in the target and reference images. Finally, for more trustworthy benchmarking, we propose a new evaluation metric named DisCIDEr for Ref-DIC, which evaluates both the accuracy and distinctiveness of the generated captions. Experimental results demonstrate that our TransDIC can generate distinctive captions. Besides, it outperforms several state-of-the-art models on the two new benchmarks over different metrics.
引用
收藏
页码:4374 / 4384
页数:11
相关论文
共 50 条
  • [1] A reference-based model using deep learning for image captioning
    Tiago do Carmo Nogueira
    Cássio Dener Noronha Vinhal
    Gélson da Cruz Júnior
    Matheus Rudolfo Diedrich Ullmann
    Thyago Carvalho Marques
    [J]. Multimedia Systems, 2023, 29 : 1665 - 1681
  • [2] A reference-based model using deep learning for image captioning
    Nogueira, Tiago do Carmo
    Noronha Vinhal, Cassio Dener
    da Cruz, Gelson, Jr.
    Diedrich Ullmann, Matheus Rudolfo
    Marques, Thyago Carvalho
    [J]. MULTIMEDIA SYSTEMS, 2023, 29 (03) : 1665 - 1681
  • [3] Reference-based model using multimodal gated recurrent units for image captioning
    Tiago do Carmo Nogueira
    Cássio Dener Noronha Vinhal
    Gélson da Cruz Júnior
    Matheus Rudolfo Diedrich Ullmann
    [J]. Multimedia Tools and Applications, 2020, 79 : 30615 - 30635
  • [4] Reference-based model using multimodal gated recurrent units for image captioning
    Nogueira, Tiago do Carmo
    Vinhal, Cassio Dener Noronha
    da Cruz Junior, Gelson
    Ullmann, Matheus Rudolfo Diedrich
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2020, 79 (41-42) : 30615 - 30635
  • [5] Reference Based LSTM for Image Captioning
    Chen, Minghai
    Ding, Guiguang
    Zhao, Sicheng
    Chen, Hui
    Han, Jungong
    Liu, Qiang
    [J]. THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 3981 - 3987
  • [6] Learning reference-based representation for image categorization
    Li, Qun
    Zhang, Honggang
    Guo, Jun
    Bhanu, Bir
    [J]. Journal of Information and Computational Science, 2012, 9 (15): : 4261 - 4269
  • [7] REFERENCE-BASED JPEG IMAGE ARTIFACTS REMOVAL
    Song, Weigang
    Ji, Jiahuan
    Zhong, Baojiang
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1681 - 1685
  • [8] Discriminative Reference-Based Scene Image Categorization
    Li, Qun
    Xu, Ding
    An, Le
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2014, E97D (10): : 2823 - 2826
  • [9] Group-based Distinctive Image Captioning with Memory Attention
    Wang, Jiuniu
    Xu, Wenjia
    Wang, Qingzhong
    Chan, Antoni B.
    [J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5020 - 5028
  • [10] Distinctive-Attribute Extraction for Image Captioning
    Kim, Boeun
    Lee, Young Han
    Jung, Hyedong
    Cho, Choongsang
    [J]. COMPUTER VISION - ECCV 2018 WORKSHOPS, PT IV, 2019, 11132 : 133 - 144