An Attention-based Regression Model for Grounding Textual Phrases in Images

被引:0
|
作者
Endo, Ko [1 ]
Aono, Masaki [1 ]
Nichols, Eric [2 ]
Funakoshi, Kotaro [2 ]
机构
[1] Toyohashi Univ Technol, Toyohashi, Aichi, Japan
[2] Honda Res Inst Japan, Wako, Saitama, Japan
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Grounding, or localizing, a textual phrase in an image is a challenging problem that is integral to visual language understanding. Previous approaches to this task typically make use of candidate region proposals, where end performance depends on that of the region proposal method and additional computational costs are incurred. In this paper, we treat grounding as a regression problem and propose a method to directly identify the region referred to by a textual phrase, eliminating the need for external candidate region prediction. Our approach uses deep neural networks to combine image and text representations and refines the target region with attention models over both image subregions and words in the textual phrase. Despite the challenging nature of this task and sparsity of available data, in evaluation on the ReferIt dataset, our proposed method achieves a new state-of-the-art in performance of 37.26% accuracy, surpassing the previously reported best by over 5 percentage points. We find that combining image and text attention models and an image attention area-sensitive loss function contribute to substantial improvements.
引用
收藏
页码:3995 / 4001
页数:7
相关论文
共 50 条
  • [11] Attention-Based SeriesNet: An Attention-Based Hybrid Neural Network Model for Conditional Time Series Forecasting
    Cheng, Yepeng
    Liu, Zuren
    Morimoto, Yasuhiko
    [J]. INFORMATION, 2020, 11 (06)
  • [12] Linguistic attention-based model for aspect extraction
    Ji, Yunjie
    Li, Jie
    Yu, Yanhua
    [J]. 2018 INTERNATIONAL CONFERENCE ON IMAGE AND VIDEO PROCESSING, AND ARTIFICIAL INTELLIGENCE, 2018, 10836
  • [13] Attention-based random forest and contamination model
    Utkin, Lev, V
    Konstantinov, Andrei, V
    [J]. NEURAL NETWORKS, 2022, 154 : 346 - 359
  • [14] An Online Attention-Based Model for Speech Recognition
    Fan, Ruchao
    Zhou, Pan
    Chen, Wei
    Jia, Jia
    Liu, Gang
    [J]. INTERSPEECH 2019, 2019, : 4390 - 4394
  • [15] An Attention-Based Diffusion Model for Psychometric Analyses
    Boehm, Udo
    Marsman, Maarten
    van der Maas, Han L. J.
    Maris, Gunter
    [J]. PSYCHOMETRIKA, 2021, 86 (04) : 938 - 972
  • [16] An Attention-Based Diffusion Model for Psychometric Analyses
    Udo Boehm
    Maarten Marsman
    Han L. J. van der Maas
    Gunter Maris
    [J]. Psychometrika, 2021, 86 : 938 - 972
  • [17] Attention-Based Model for Accurate Stance Detection
    Hamad, Omama
    Hamdi, Ali
    Shaban, Khaled
    [J]. TEXT, SPEECH, AND DIALOGUE (TSD 2022), 2022, 13502 : 212 - 224
  • [18] Contour extraction of medical images using an attention-based network
    Lv, Ju Jian
    Chen, Hao Yuan
    Li, Jia Wen
    Lin, Kai Han
    Chen, Rong Jun
    Wang, Lei Jun
    Zeng, Xian Xian
    Ren, Jin Chang
    Zhao, Hui Min
    [J]. BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2023, 84
  • [19] Locally Adaptive Channel Attention-Based Network for Denoising Images
    Lee, Haeyun
    Cho, Sunghyun
    [J]. IEEE ACCESS, 2020, 8 : 34686 - 34695
  • [20] Attention-Based Matching Approach for Heterogeneous Remote Sensing Images
    Hou, Huitai
    Lan, Chaozhen
    Xu, Qing
    Lv, Liang
    Xiong, Xin
    Yao, Fushan
    Wang, Longhao
    [J]. REMOTE SENSING, 2023, 15 (01)