Visual Grounding Via Accumulated Attention

被引：9

作者：

Deng, Chaorui ^{[1
,2
]}

Wu, Qi ^{[3
]}

Wu, Qingyao ^{[1
]}

Hu, Fuyuan ^{[4
]}

Lyu, Fan ^{[5
]}

Tan, Mingkui ^{[1
]}

机构：

[1] South China Univ Technol, Sch Software Engn, Guangzhou 510006, Peoples R China

[2] Pazhou Lab, Guangzhou 510335, Peoples R China

[3] Univ Adelaide, Sch Comp Sci, Adelaide, SA 5005, Australia

[4] Suzhou Univ Sci & Technol, Sch Elect & Informat Engn, Suzhou 215009, Peoples R China

[5] Tianjin Univ, Coll Intelligence & Comp, Tianjin 300350, Peoples R China

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2022年 / 44卷 / 03期

基金：

澳大利亚研究理事会; 中国国家自然科学基金;

关键词：

Proposals; Visualization; Training; Feature extraction; Task analysis; Grounding; Cognition; Visual grounding; accumulated attention; noised training strategy; bounding box regression;

D O I：

10.1109/TPAMI.2020.3023438

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. Generally, it requires the machine to first understand the query, identify the key concepts in the image, and then locate the target object by specifying its bounding box. However, in many real-world visual grounding applications, we have to face with ambiguous queries and images with complicated scene structures. Identifying the target based on highly redundant and correlated information can be very challenging, and often leading to unsatisfactory performance. To tackle this, in this paper, we exploit an attention module for each kind of information to reduce internal redundancies. We then propose an accumulated attention (A-ATT) mechanism to reason among all the attention modules jointly. In this way, the relation among different kinds of information can be explicitly captured. Moreover, to improve the performance and robustness of our VG models, we additionally introduce some noises into the training procedure to bridge the distribution gap between the human-labeled training data and the real-world poor quality data. With this "noised" training strategy, we can further learn a bounding box regressor, which can be used to refine the bounding box of the target object. We evaluate the proposed methods on four popular datasets (namely ReferCOCO, ReferCOCO+, ReferCOCOg, and GuessWhat?!). The experimental results show that our methods significantly outperform all previous works on every dataset in terms of accuracy.

引用

页码：1670 / 1684

页数：15

共 50 条

[21] Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding
Ye, Jiabo
Tian, Junfeng
Yan, Ming
Yang, Xiaoshan
Wang, Xuwu
Zhang, Ji
He, Liang
Lin, Xin
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15481 - 15491
[22] Word2Pix: Word to Pixel Cross-Attention Transformer in Visual Grounding
Zhao, Heng
Zhou, Joey Tianyi
Ong, Yew-Soon
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (02) : 1523 - 1533
[23] RESIDUAL GRAPH ATTENTION NETWORK AND EXPRESSION-RESPECT DATA AUGMENTATION AIDED VISUAL GROUNDING
Wang, Jia
Wu, Hung-Yi
Chen, Jun-Cheng
Shuai, Hong-Han
Cheng, Wen-Huang
2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 326 - 330
[24] Fabric Defects Detection via Visual Attention Mechanism
Li, Ning
Zhao, Jianyu
Jiang, Ping
2017 CHINESE AUTOMATION CONGRESS (CAC), 2017, : 2956 - 2960
[25] MODELING VISUAL-ATTENTION VIA SELECTIVE TUNING
TSOTSOS, JK
CULHANE, SM
WAI, WYK
LAI, YH
DAVIS, N
NUFLO, F
ARTIFICIAL INTELLIGENCE, 1995, 78 (1-2) : 507 - 545
[26] Paper Defects Detection via Visual Attention Mechanism
Jiang Ping
Gao Tao
2011 30TH CHINESE CONTROL CONFERENCE (CCC), 2011, : 5852 - 5856
[27] Image Caption via Visual Attention Switch on DenseNet
Hao, Yanlong
Xie, Jiyang
Lin, Zhiqing
PROCEEDINGS OF 2018 INTERNATIONAL CONFERENCE ON NETWORK INFRASTRUCTURE AND DIGITAL CONTENT (IEEE IC-NIDC), 2018, : 334 - 338
[28] The Application of a Novel Target Region Extraction Model Based on Object-accumulated Visual Attention Mechanism
Xiao, Jie
Cai, Chao
Ding, Mingyue
Zhou, Chengping
ICNC 2008: FOURTH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, VOL 6, PROCEEDINGS, 2008, : 116 - 120
[29] Gaze Assisted Visual Grounding
Johari, Kritika
Tong, Christopher Tay Zi
Subbaraju, Vigneshwaran
Kim, Jung-Jae
Tan, U-Xuan
SOCIAL ROBOTICS, ICSR 2021, 2021, 13086 : 191 - 202
[30] Sentence Attention Blocks for Answer Grounding
Khoshsirat, Seyedalireza
Kambhamettu, Chandra
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 6057 - 6067

← 1 2 3 4 5 →