An Attention-based Regression Model for Grounding Textual Phrases in Images

被引:0
|
作者
Endo, Ko [1 ]
Aono, Masaki [1 ]
Nichols, Eric [2 ]
Funakoshi, Kotaro [2 ]
机构
[1] Toyohashi Univ Technol, Toyohashi, Aichi, Japan
[2] Honda Res Inst Japan, Wako, Saitama, Japan
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Grounding, or localizing, a textual phrase in an image is a challenging problem that is integral to visual language understanding. Previous approaches to this task typically make use of candidate region proposals, where end performance depends on that of the region proposal method and additional computational costs are incurred. In this paper, we treat grounding as a regression problem and propose a method to directly identify the region referred to by a textual phrase, eliminating the need for external candidate region prediction. Our approach uses deep neural networks to combine image and text representations and refines the target region with attention models over both image subregions and words in the textual phrase. Despite the challenging nature of this task and sparsity of available data, in evaluation on the ReferIt dataset, our proposed method achieves a new state-of-the-art in performance of 37.26% accuracy, surpassing the previously reported best by over 5 percentage points. We find that combining image and text attention models and an image attention area-sensitive loss function contribute to substantial improvements.
引用
收藏
页码:3995 / 4001
页数:7
相关论文
共 50 条
  • [1] Grounding of Textual Phrases in Images by Reconstruction
    Rohrbach, Anna
    Rohrbach, Marcus
    Hu, Ronghang
    Darrell, Trevor
    Schiele, Bernt
    [J]. COMPUTER VISION - ECCV 2016, PT I, 2016, 9905 : 817 - 834
  • [2] Object Detection in Aerial Images with Attention-based Regression Loss
    Doloriel, Chandler Timm C.
    Cajote, Rhandley D.
    [J]. PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 1187 - 1191
  • [3] Attention-Based Keyword Localisation in Speech using Visual Grounding
    Olaleye, Kayode
    Kamper, Herman
    [J]. INTERSPEECH 2021, 2021, : 2991 - 2995
  • [4] Two-Stage Attention-Based Model for Code Search with Textual and Structural Features
    Xu, Ling
    Yang, Huanhuan
    Liu, Chao
    Shuai, Jianhang
    Yan, Meng
    Lei, Yan
    Xu, Zhou
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION AND REENGINEERING (SANER 2021), 2021, : 342 - 353
  • [5] Dynamic attention-based explainable recommendation with textual and visual fusion
    Liu, Peng
    Zhang, Lemei
    Gulla, Jon Atle
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2020, 57 (06)
  • [6] Attention-Based Interpretable Regression of Gene Expression in Histology
    Graziani, Mara
    Marini, Niccolo
    Deutschmann, Nicolas
    Janakarajan, Nikita
    Mueller, Henning
    Martinez, Maria Rodriguez
    [J]. INTERPRETABILITY OF MACHINE INTELLIGENCE IN MEDICAL IMAGE COMPUTING, IMIMIC 2022, 2022, 13611 : 44 - 60
  • [7] Attention-based hierarchical fusion of visible and infrared images
    Chen, Yanfei
    Sang, Nong
    [J]. OPTIK, 2015, 126 (23): : 4243 - 4248
  • [8] MAPS:: Multiscale Attention-based PreSegmentation of color images
    Ouerhani, N
    Hügli, H
    [J]. SCALE SPACE METHODS IN COMPUTER VISION, PROCEEDINGS, 2003, 2695 : 537 - 549
  • [9] A Novel Attention-Based Model for Semantic Segmentation of Prostate Glands Using Histopathological Images
    Inamdar, Mahesh Anil
    Raghavendra, U.
    Gudigar, Anjan
    Bhandary, Sarvesh
    Salvi, Massimo
    Deo, Ravinesh C.
    Barua, Prabal Datta
    Ciaccio, Edward J.
    Molinari, Filippo
    Acharya, U. Rajendra
    [J]. IEEE ACCESS, 2023, 11 : 108982 - 108994
  • [10] Attention-Based Overall Enhance Network for Chinese Semantic Textual Similarity Measure
    Zhang, Hao
    Zhang, HuaXiong
    Lu, XingYu
    Gao, Qiang
    [J]. JOURNAL OF APPLIED SCIENCE AND ENGINEERING, 2022, 25 (02): : 287 - +