Visual Grounding via Accumulated Attention

被引：108

作者：

Deng, Chaorui ^{[1
]}

Wu, Qi ^{[2
]}

Wu, Qingyao ^{[1
]}

Hu, Fuyuan ^{[3
]}

Lyu, Fan ^{[3
]}

Tan, Mingkui ^{[1
]}

机构：

[1] South China Univ Technol, Sch Software Engn, Guangzhou, Guangdong, Peoples R China

[2] Univ Adelaide, Australia Ctr Robot Vis, Adelaide, SA, Australia

[3] Suzhou Univ Sci & Technol, Sch Elect & Informat Engn, Suzhou, Peoples R China

来源：

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2018年

基金：

中国国家自然科学基金;

关键词：

D O I：

10.1109/CVPR.2018.00808

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence or even a multi-round dialogue. There are three main challenges in VG: 1) what is the main focus in a query; 2) how to understand an image; 3) how to locate an object. Most existing methods combine all the information curtly, which may suffer from the problem of information redundancy (i.e. ambiguous query, complicated image and a large number of objects). In this paper, we formulate these challenges as three attention problems and propose an accumulated attention (A-ATT) mechanism to reason among them jointly. Our A-AT mechanism can circularly accumulate the attention for useful information in image, query, and objects, while the noises are ignored gradually. We evaluate the performance of A-ATT on four popular datasets (namely Refer-COCO, ReferCOCO+, ReferCOCOg, and Guesswhat?!), and the experimental results show the superiority of the proposed method in term of accuracy.

引用

页码：7746 / 7755

页数：10

共 50 条

[1] Visual Grounding Via Accumulated Attention
Deng, Chaorui
Wu, Qi
Wu, Qingyao
Hu, Fuyuan
Lyu, Fan
Tan, Mingkui
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (03) : 1670 - 1684
[2] Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining
Zhang, Yundong
Niebles, Juan Carlos
Soto, Alvaro
[J]. 2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 349 - 357
[3] Language conditioned multi-scale visual attention networks for visual grounding
Yao, Haibo
Wang, Lipeng
Cai, Chengtao
Wang, Wei
Zhang, Zhi
Shang, Xiaobing
[J]. IMAGE AND VISION COMPUTING, 2024, 150
[4] Countering Language Drift via Visual Grounding
Lee, Jason
Cho, Kyunghyun
Kiela, Douwe
[J]. 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 4385 - 4395
[5] A Visual Attention Grounding Neural Model for Multimodal Machine Translation
Zhou, Mingyang
Cheng, Runxiang
Lee, Yong Jae
Yu, Zhou
[J]. 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 3643 - 3653
[6] Hierarchical cross-modal contextual attention network for visual grounding
Xin Xu
Gang Lv
Yining Sun
Yuxia Hu
Fudong Nian
[J]. Multimedia Systems, 2023, 29 : 2073 - 2083
[7] Attention-Based Keyword Localisation in Speech using Visual Grounding
Olaleye, Kayode
Kamper, Herman
[J]. INTERSPEECH 2021, 2021, : 2991 - 2995
[8] Hierarchical cross-modal contextual attention network for visual grounding
Xu, Xin
Lv, Gang
Sun, Yining
Hu, Yuxia
Nian, Fudong
[J]. MULTIMEDIA SYSTEMS, 2023, 29 (04) : 2073 - 2083
[9] Diagram Visual Grounding: Learning to See with Gestalt-Perceptual Attention
Hu, Xin
Zhang, Lingling
Liu, Jun
Zhang, Xinyu
Wu, Wenjun
Wang, Qianying
[J]. PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 837 - 845
[10] VISUAL GROUNDING
CUMBOW, RC
[J]. AMERICAN FILM, 1978, 3 (10): : 16 - 16

← 1 2 3 4 5 →