Visual Grounding Via Accumulated Attention

被引:9
|
作者
Deng, Chaorui [1 ,2 ]
Wu, Qi [3 ]
Wu, Qingyao [1 ]
Hu, Fuyuan [4 ]
Lyu, Fan [5 ]
Tan, Mingkui [1 ]
机构
[1] South China Univ Technol, Sch Software Engn, Guangzhou 510006, Peoples R China
[2] Pazhou Lab, Guangzhou 510335, Peoples R China
[3] Univ Adelaide, Sch Comp Sci, Adelaide, SA 5005, Australia
[4] Suzhou Univ Sci & Technol, Sch Elect & Informat Engn, Suzhou 215009, Peoples R China
[5] Tianjin Univ, Coll Intelligence & Comp, Tianjin 300350, Peoples R China
基金
澳大利亚研究理事会; 中国国家自然科学基金;
关键词
Proposals; Visualization; Training; Feature extraction; Task analysis; Grounding; Cognition; Visual grounding; accumulated attention; noised training strategy; bounding box regression;
D O I
10.1109/TPAMI.2020.3023438
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. Generally, it requires the machine to first understand the query, identify the key concepts in the image, and then locate the target object by specifying its bounding box. However, in many real-world visual grounding applications, we have to face with ambiguous queries and images with complicated scene structures. Identifying the target based on highly redundant and correlated information can be very challenging, and often leading to unsatisfactory performance. To tackle this, in this paper, we exploit an attention module for each kind of information to reduce internal redundancies. We then propose an accumulated attention (A-ATT) mechanism to reason among all the attention modules jointly. In this way, the relation among different kinds of information can be explicitly captured. Moreover, to improve the performance and robustness of our VG models, we additionally introduce some noises into the training procedure to bridge the distribution gap between the human-labeled training data and the real-world poor quality data. With this "noised" training strategy, we can further learn a bounding box regressor, which can be used to refine the bounding box of the target object. We evaluate the proposed methods on four popular datasets (namely ReferCOCO, ReferCOCO+, ReferCOCOg, and GuessWhat?!). The experimental results show that our methods significantly outperform all previous works on every dataset in terms of accuracy.
引用
收藏
页码:1670 / 1684
页数:15
相关论文
共 50 条
  • [1] Visual Grounding via Accumulated Attention
    Deng, Chaorui
    Wu, Qi
    Wu, Qingyao
    Hu, Fuyuan
    Lyu, Fan
    Tan, Mingkui
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7746 - 7755
  • [2] Language conditioned multi-scale visual attention networks for visual grounding
    Yao, Haibo
    Wang, Lipeng
    Cai, Chengtao
    Wang, Wei
    Zhang, Zhi
    Shang, Xiaobing
    IMAGE AND VISION COMPUTING, 2024, 150
  • [3] Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining
    Zhang, Yundong
    Niebles, Juan Carlos
    Soto, Alvaro
    2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 349 - 357
  • [4] Countering Language Drift via Visual Grounding
    Lee, Jason
    Cho, Kyunghyun
    Kiela, Douwe
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 4385 - 4395
  • [5] A Visual Attention Grounding Neural Model for Multimodal Machine Translation
    Zhou, Mingyang
    Cheng, Runxiang
    Lee, Yong Jae
    Yu, Zhou
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 3643 - 3653
  • [6] Hierarchical cross-modal contextual attention network for visual grounding
    Xin Xu
    Gang Lv
    Yining Sun
    Yuxia Hu
    Fudong Nian
    Multimedia Systems, 2023, 29 : 2073 - 2083
  • [7] Attention-Based Keyword Localisation in Speech using Visual Grounding
    Olaleye, Kayode
    Kamper, Herman
    INTERSPEECH 2021, 2021, : 2991 - 2995
  • [8] Hierarchical cross-modal contextual attention network for visual grounding
    Xu, Xin
    Lv, Gang
    Sun, Yining
    Hu, Yuxia
    Nian, Fudong
    MULTIMEDIA SYSTEMS, 2023, 29 (04) : 2073 - 2083
  • [9] Diagram Visual Grounding: Learning to See with Gestalt-Perceptual Attention
    Hu, Xin
    Zhang, Lingling
    Liu, Jun
    Zhang, Xinyu
    Wu, Wenjun
    Wang, Qianying
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 837 - 845
  • [10] VISUAL GROUNDING
    CUMBOW, RC
    AMERICAN FILM, 1978, 3 (10): : 16 - 16