Grounding of Textual Phrases in Images by Reconstruction

被引:249
|
作者
Rohrbach, Anna [1 ]
Rohrbach, Marcus [2 ,3 ]
Hu, Ronghang [2 ]
Darrell, Trevor [2 ]
Schiele, Bernt [1 ]
机构
[1] Max Planck Inst Informat, Saarbrucken, Germany
[2] Univ Calif Berkeley, EECS, Berkeley, CA USA
[3] ICSI, Berkeley, CA USA
来源
关键词
D O I
10.1007/978-3-319-46448-0_49
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Grounding (i.e. localizing) arbitrary, free-form textual phrases in visual content is a challenging problem with many applications for human-computer interaction and image-text reference resolution. Few datasets provide the ground truth spatial localization of phrases, thus it is desirable to learn from data with no or little grounding supervision. We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly. During training our approach encodes the phrase using a recurrent network language model and then learns to attend to the relevant image region in order to reconstruct the input phrase. At test time, the correct attention, i.e., the grounding, is evaluated. If grounding supervision is available it can be directly applied via a loss over the attention mechanism. We demonstrate the effectiveness of our approach on the Flickr30k Entities and ReferItGame datasets with different levels of supervision, ranging from no supervision over partial supervision to full supervision. Our supervised variant improves by a large margin over the state-of-the-art on both datasets.
引用
收藏
页码:817 / 834
页数:18
相关论文
共 50 条
  • [1] An Attention-based Regression Model for Grounding Textual Phrases in Images
    Endo, Ko
    Aono, Masaki
    Nichols, Eric
    Funakoshi, Kotaro
    [J]. PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 3995 - 4001
  • [2] Temponym Tagging: Temporal Scopes for Textual Phrases
    Kuzey, Erdal
    Stroetgen, Jannik
    Setty, Vinay
    Weikum, Gerhard
    [J]. PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'16 COMPANION), 2016, : 841 - 842
  • [3] Modularized Textual Grounding for Counterfactual Resilience
    Fang, Zhiyuan
    Kong, Shu
    Fowlkes, Charless
    Yang, Yezhou
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6371 - 6381
  • [4] A Better Loss for Visual-Textual Grounding
    Rigoni, Davide
    Serafini, Luciano
    Sperduti, Alessandro
    [J]. 37TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, 2022, : 49 - 57
  • [5] Unsupervised Textual Grounding: LinkingWords to Image Concepts
    Yeh, Raymond A.
    Do, Minh N.
    Schwing, Alexander G.
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6125 - 6134
  • [6] Weakly-supervised Visual Grounding of Phrases with Linguistic Structures
    Xiao, Fanyi
    Sigal, Leonid
    Lee, Yong Jae
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 5253 - 5262
  • [7] As Time Goes By: Comprehensive Tagging of Textual Phrases with Temporal Scopes
    Kuzey, Erdal
    Setty, Vinay
    Stroetgen, Jannik
    Weikum, Gerhard
    [J]. PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'16), 2016, : 915 - 925
  • [8] Nabokov and Textual Images
    Vydra, Anton
    [J]. FILOZOFIA, 2008, 63 (07): : 611 - 618
  • [9] Textual description of images
    Larabi, S.
    [J]. COMPUTATIONAL MODELLING OF OBJECTS REPRESENTED IN IMAGES: FUNDAMENTALS, METHODS AND APPLICATIONS, 2007, : 241 - 246
  • [10] Enriching Phrases with Coupled Pixel and Object Contexts for Panoptic Narrative Grounding
    Hui, Tianrui
    Ding, Zihan
    Huang, Junshi
    Wei, Xiaoming
    Wei, Xiaolin
    Dai, Jiao
    Han, Jizhong
    Liu, Si
    [J]. PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 893 - 901