Visual Grounding via Accumulated Attention

被引:108
|
作者
Deng, Chaorui [1 ]
Wu, Qi [2 ]
Wu, Qingyao [1 ]
Hu, Fuyuan [3 ]
Lyu, Fan [3 ]
Tan, Mingkui [1 ]
机构
[1] South China Univ Technol, Sch Software Engn, Guangzhou, Guangdong, Peoples R China
[2] Univ Adelaide, Australia Ctr Robot Vis, Adelaide, SA, Australia
[3] Suzhou Univ Sci & Technol, Sch Elect & Informat Engn, Suzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR.2018.00808
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence or even a multi-round dialogue. There are three main challenges in VG: 1) what is the main focus in a query; 2) how to understand an image; 3) how to locate an object. Most existing methods combine all the information curtly, which may suffer from the problem of information redundancy (i.e. ambiguous query, complicated image and a large number of objects). In this paper, we formulate these challenges as three attention problems and propose an accumulated attention (A-ATT) mechanism to reason among them jointly. Our A-AT mechanism can circularly accumulate the attention for useful information in image, query, and objects, while the noises are ignored gradually. We evaluate the performance of A-ATT on four popular datasets (namely Refer-COCO, ReferCOCO+, ReferCOCOg, and Guesswhat?!), and the experimental results show the superiority of the proposed method in term of accuracy.
引用
收藏
页码:7746 / 7755
页数:10
相关论文
共 50 条
  • [21] Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding
    Ye, Jiabo
    Tian, Junfeng
    Yan, Ming
    Yang, Xiaoshan
    Wang, Xuwu
    Zhang, Ji
    He, Liang
    Lin, Xin
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15481 - 15491
  • [22] Word2Pix: Word to Pixel Cross-Attention Transformer in Visual Grounding
    Zhao, Heng
    Zhou, Joey Tianyi
    Ong, Yew-Soon
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (02) : 1523 - 1533
  • [23] RESIDUAL GRAPH ATTENTION NETWORK AND EXPRESSION-RESPECT DATA AUGMENTATION AIDED VISUAL GROUNDING
    Wang, Jia
    Wu, Hung-Yi
    Chen, Jun-Cheng
    Shuai, Hong-Han
    Cheng, Wen-Huang
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 326 - 330
  • [24] MODELING VISUAL-ATTENTION VIA SELECTIVE TUNING
    TSOTSOS, JK
    CULHANE, SM
    WAI, WYK
    LAI, YH
    DAVIS, N
    NUFLO, F
    [J]. ARTIFICIAL INTELLIGENCE, 1995, 78 (1-2) : 507 - 545
  • [25] Fabric Defects Detection via Visual Attention Mechanism
    Li, Ning
    Zhao, Jianyu
    Jiang, Ping
    [J]. 2017 CHINESE AUTOMATION CONGRESS (CAC), 2017, : 2956 - 2960
  • [26] Paper Defects Detection via Visual Attention Mechanism
    Jiang Ping
    Gao Tao
    [J]. 2011 30TH CHINESE CONTROL CONFERENCE (CCC), 2011, : 5852 - 5856
  • [27] Image Caption via Visual Attention Switch on DenseNet
    Hao, Yanlong
    Xie, Jiyang
    Lin, Zhiqing
    [J]. PROCEEDINGS OF 2018 INTERNATIONAL CONFERENCE ON NETWORK INFRASTRUCTURE AND DIGITAL CONTENT (IEEE IC-NIDC), 2018, : 334 - 338
  • [28] The Application of a Novel Target Region Extraction Model Based on Object-accumulated Visual Attention Mechanism
    Xiao, Jie
    Cai, Chao
    Ding, Mingyue
    Zhou, Chengping
    [J]. ICNC 2008: FOURTH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, VOL 6, PROCEEDINGS, 2008, : 116 - 120
  • [29] Gaze Assisted Visual Grounding
    Johari, Kritika
    Tong, Christopher Tay Zi
    Subbaraju, Vigneshwaran
    Kim, Jung-Jae
    Tan, U-Xuan
    [J]. SOCIAL ROBOTICS, ICSR 2021, 2021, 13086 : 191 - 202
  • [30] Sentence Attention Blocks for Answer Grounding
    Khoshsirat, Seyedalireza
    Kambhamettu, Chandra
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 6057 - 6067