Visual Grounding Via Accumulated Attention

被引:9
|
作者
Deng, Chaorui [1 ,2 ]
Wu, Qi [3 ]
Wu, Qingyao [1 ]
Hu, Fuyuan [4 ]
Lyu, Fan [5 ]
Tan, Mingkui [1 ]
机构
[1] South China Univ Technol, Sch Software Engn, Guangzhou 510006, Peoples R China
[2] Pazhou Lab, Guangzhou 510335, Peoples R China
[3] Univ Adelaide, Sch Comp Sci, Adelaide, SA 5005, Australia
[4] Suzhou Univ Sci & Technol, Sch Elect & Informat Engn, Suzhou 215009, Peoples R China
[5] Tianjin Univ, Coll Intelligence & Comp, Tianjin 300350, Peoples R China
基金
澳大利亚研究理事会; 中国国家自然科学基金;
关键词
Proposals; Visualization; Training; Feature extraction; Task analysis; Grounding; Cognition; Visual grounding; accumulated attention; noised training strategy; bounding box regression;
D O I
10.1109/TPAMI.2020.3023438
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. Generally, it requires the machine to first understand the query, identify the key concepts in the image, and then locate the target object by specifying its bounding box. However, in many real-world visual grounding applications, we have to face with ambiguous queries and images with complicated scene structures. Identifying the target based on highly redundant and correlated information can be very challenging, and often leading to unsatisfactory performance. To tackle this, in this paper, we exploit an attention module for each kind of information to reduce internal redundancies. We then propose an accumulated attention (A-ATT) mechanism to reason among all the attention modules jointly. In this way, the relation among different kinds of information can be explicitly captured. Moreover, to improve the performance and robustness of our VG models, we additionally introduce some noises into the training procedure to bridge the distribution gap between the human-labeled training data and the real-world poor quality data. With this "noised" training strategy, we can further learn a bounding box regressor, which can be used to refine the bounding box of the target object. We evaluate the proposed methods on four popular datasets (namely ReferCOCO, ReferCOCO+, ReferCOCOg, and GuessWhat?!). The experimental results show that our methods significantly outperform all previous works on every dataset in terms of accuracy.
引用
收藏
页码:1670 / 1684
页数:15
相关论文
共 50 条
  • [21] Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding
    Ye, Jiabo
    Tian, Junfeng
    Yan, Ming
    Yang, Xiaoshan
    Wang, Xuwu
    Zhang, Ji
    He, Liang
    Lin, Xin
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15481 - 15491
  • [22] Word2Pix: Word to Pixel Cross-Attention Transformer in Visual Grounding
    Zhao, Heng
    Zhou, Joey Tianyi
    Ong, Yew-Soon
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (02) : 1523 - 1533
  • [23] RESIDUAL GRAPH ATTENTION NETWORK AND EXPRESSION-RESPECT DATA AUGMENTATION AIDED VISUAL GROUNDING
    Wang, Jia
    Wu, Hung-Yi
    Chen, Jun-Cheng
    Shuai, Hong-Han
    Cheng, Wen-Huang
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 326 - 330
  • [24] Fabric Defects Detection via Visual Attention Mechanism
    Li, Ning
    Zhao, Jianyu
    Jiang, Ping
    2017 CHINESE AUTOMATION CONGRESS (CAC), 2017, : 2956 - 2960
  • [25] MODELING VISUAL-ATTENTION VIA SELECTIVE TUNING
    TSOTSOS, JK
    CULHANE, SM
    WAI, WYK
    LAI, YH
    DAVIS, N
    NUFLO, F
    ARTIFICIAL INTELLIGENCE, 1995, 78 (1-2) : 507 - 545
  • [26] Paper Defects Detection via Visual Attention Mechanism
    Jiang Ping
    Gao Tao
    2011 30TH CHINESE CONTROL CONFERENCE (CCC), 2011, : 5852 - 5856
  • [27] Image Caption via Visual Attention Switch on DenseNet
    Hao, Yanlong
    Xie, Jiyang
    Lin, Zhiqing
    PROCEEDINGS OF 2018 INTERNATIONAL CONFERENCE ON NETWORK INFRASTRUCTURE AND DIGITAL CONTENT (IEEE IC-NIDC), 2018, : 334 - 338
  • [28] The Application of a Novel Target Region Extraction Model Based on Object-accumulated Visual Attention Mechanism
    Xiao, Jie
    Cai, Chao
    Ding, Mingyue
    Zhou, Chengping
    ICNC 2008: FOURTH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, VOL 6, PROCEEDINGS, 2008, : 116 - 120
  • [29] Gaze Assisted Visual Grounding
    Johari, Kritika
    Tong, Christopher Tay Zi
    Subbaraju, Vigneshwaran
    Kim, Jung-Jae
    Tan, U-Xuan
    SOCIAL ROBOTICS, ICSR 2021, 2021, 13086 : 191 - 202
  • [30] Sentence Attention Blocks for Answer Grounding
    Khoshsirat, Seyedalireza
    Kambhamettu, Chandra
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 6057 - 6067