Visual Grounding Via Accumulated Attention

被引：9

作者：

Deng, Chaorui ^{[1
,2
]}

Wu, Qi ^{[3
]}

Wu, Qingyao ^{[1
]}

Hu, Fuyuan ^{[4
]}

Lyu, Fan ^{[5
]}

Tan, Mingkui ^{[1
]}

机构：

[1] South China Univ Technol, Sch Software Engn, Guangzhou 510006, Peoples R China

[2] Pazhou Lab, Guangzhou 510335, Peoples R China

[3] Univ Adelaide, Sch Comp Sci, Adelaide, SA 5005, Australia

[4] Suzhou Univ Sci & Technol, Sch Elect & Informat Engn, Suzhou 215009, Peoples R China

[5] Tianjin Univ, Coll Intelligence & Comp, Tianjin 300350, Peoples R China

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2022年 / 44卷 / 03期

基金：

澳大利亚研究理事会; 中国国家自然科学基金;

关键词：

Proposals; Visualization; Training; Feature extraction; Task analysis; Grounding; Cognition; Visual grounding; accumulated attention; noised training strategy; bounding box regression;

D O I：

10.1109/TPAMI.2020.3023438

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. Generally, it requires the machine to first understand the query, identify the key concepts in the image, and then locate the target object by specifying its bounding box. However, in many real-world visual grounding applications, we have to face with ambiguous queries and images with complicated scene structures. Identifying the target based on highly redundant and correlated information can be very challenging, and often leading to unsatisfactory performance. To tackle this, in this paper, we exploit an attention module for each kind of information to reduce internal redundancies. We then propose an accumulated attention (A-ATT) mechanism to reason among all the attention modules jointly. In this way, the relation among different kinds of information can be explicitly captured. Moreover, to improve the performance and robustness of our VG models, we additionally introduce some noises into the training procedure to bridge the distribution gap between the human-labeled training data and the real-world poor quality data. With this "noised" training strategy, we can further learn a bounding box regressor, which can be used to refine the bounding box of the target object. We evaluate the proposed methods on four popular datasets (namely ReferCOCO, ReferCOCO+, ReferCOCOg, and GuessWhat?!). The experimental results show that our methods significantly outperform all previous works on every dataset in terms of accuracy.

引用

页码：1670 / 1684

页数：15

共 50 条

[1] Visual Grounding via Accumulated Attention
Deng, Chaorui
Wu, Qi
Wu, Qingyao
Hu, Fuyuan
Lyu, Fan
Tan, Mingkui
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7746 - 7755
[2] Language conditioned multi-scale visual attention networks for visual grounding
Yao, Haibo
Wang, Lipeng
Cai, Chengtao
Wang, Wei
Zhang, Zhi
Shang, Xiaobing
IMAGE AND VISION COMPUTING, 2024, 150
[3] Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining
Zhang, Yundong
Niebles, Juan Carlos
Soto, Alvaro
2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 349 - 357
[4] Countering Language Drift via Visual Grounding
Lee, Jason
Cho, Kyunghyun
Kiela, Douwe
2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 4385 - 4395
[5] A Visual Attention Grounding Neural Model for Multimodal Machine Translation
Zhou, Mingyang
Cheng, Runxiang
Lee, Yong Jae
Yu, Zhou
2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 3643 - 3653
[6] Hierarchical cross-modal contextual attention network for visual grounding
Xin Xu
Gang Lv
Yining Sun
Yuxia Hu
Fudong Nian
Multimedia Systems, 2023, 29 : 2073 - 2083
[7] Attention-Based Keyword Localisation in Speech using Visual Grounding
Olaleye, Kayode
Kamper, Herman
INTERSPEECH 2021, 2021, : 2991 - 2995
[8] Hierarchical cross-modal contextual attention network for visual grounding
Xu, Xin
Lv, Gang
Sun, Yining
Hu, Yuxia
Nian, Fudong
MULTIMEDIA SYSTEMS, 2023, 29 (04) : 2073 - 2083
[9] Diagram Visual Grounding: Learning to See with Gestalt-Perceptual Attention
Hu, Xin
Zhang, Lingling
Liu, Jun
Zhang, Xinyu
Wu, Wenjun
Wang, Qianying
PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 837 - 845
[10] VISUAL GROUNDING
CUMBOW, RC
AMERICAN FILM, 1978, 3 (10): : 16 - 16

← 1 2 3 4 5 →