Visual Grounding via Accumulated Attention

被引:108
|
作者
Deng, Chaorui [1 ]
Wu, Qi [2 ]
Wu, Qingyao [1 ]
Hu, Fuyuan [3 ]
Lyu, Fan [3 ]
Tan, Mingkui [1 ]
机构
[1] South China Univ Technol, Sch Software Engn, Guangzhou, Guangdong, Peoples R China
[2] Univ Adelaide, Australia Ctr Robot Vis, Adelaide, SA, Australia
[3] Suzhou Univ Sci & Technol, Sch Elect & Informat Engn, Suzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR.2018.00808
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence or even a multi-round dialogue. There are three main challenges in VG: 1) what is the main focus in a query; 2) how to understand an image; 3) how to locate an object. Most existing methods combine all the information curtly, which may suffer from the problem of information redundancy (i.e. ambiguous query, complicated image and a large number of objects). In this paper, we formulate these challenges as three attention problems and propose an accumulated attention (A-ATT) mechanism to reason among them jointly. Our A-AT mechanism can circularly accumulate the attention for useful information in image, query, and objects, while the noises are ignored gradually. We evaluate the performance of A-ATT on four popular datasets (namely Refer-COCO, ReferCOCO+, ReferCOCOg, and Guesswhat?!), and the experimental results show the superiority of the proposed method in term of accuracy.
引用
收藏
页码:7746 / 7755
页数:10
相关论文
共 50 条
  • [1] Visual Grounding Via Accumulated Attention
    Deng, Chaorui
    Wu, Qi
    Wu, Qingyao
    Hu, Fuyuan
    Lyu, Fan
    Tan, Mingkui
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (03) : 1670 - 1684
  • [2] Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining
    Zhang, Yundong
    Niebles, Juan Carlos
    Soto, Alvaro
    [J]. 2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 349 - 357
  • [3] Language conditioned multi-scale visual attention networks for visual grounding
    Yao, Haibo
    Wang, Lipeng
    Cai, Chengtao
    Wang, Wei
    Zhang, Zhi
    Shang, Xiaobing
    [J]. IMAGE AND VISION COMPUTING, 2024, 150
  • [4] Countering Language Drift via Visual Grounding
    Lee, Jason
    Cho, Kyunghyun
    Kiela, Douwe
    [J]. 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 4385 - 4395
  • [5] A Visual Attention Grounding Neural Model for Multimodal Machine Translation
    Zhou, Mingyang
    Cheng, Runxiang
    Lee, Yong Jae
    Yu, Zhou
    [J]. 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 3643 - 3653
  • [6] Hierarchical cross-modal contextual attention network for visual grounding
    Xin Xu
    Gang Lv
    Yining Sun
    Yuxia Hu
    Fudong Nian
    [J]. Multimedia Systems, 2023, 29 : 2073 - 2083
  • [7] Attention-Based Keyword Localisation in Speech using Visual Grounding
    Olaleye, Kayode
    Kamper, Herman
    [J]. INTERSPEECH 2021, 2021, : 2991 - 2995
  • [8] Hierarchical cross-modal contextual attention network for visual grounding
    Xu, Xin
    Lv, Gang
    Sun, Yining
    Hu, Yuxia
    Nian, Fudong
    [J]. MULTIMEDIA SYSTEMS, 2023, 29 (04) : 2073 - 2083
  • [9] Diagram Visual Grounding: Learning to See with Gestalt-Perceptual Attention
    Hu, Xin
    Zhang, Lingling
    Liu, Jun
    Zhang, Xinyu
    Wu, Wenjun
    Wang, Qianying
    [J]. PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 837 - 845
  • [10] VISUAL GROUNDING
    CUMBOW, RC
    [J]. AMERICAN FILM, 1978, 3 (10): : 16 - 16