Deconfounded Visual Grounding

被引：0

作者：

Huang, Jianqiang ^{[1
,2
]}

Qin, Yu ^{[2
]}

Qi, Jiaxin ^{[1
]}

Sun, Qianru ^{[3
]}

Zhang, Hanwang ^{[1
]}

机构：

[1] Nanyang Technol Univ, Singapore, Singapore

[2] Alibaba Grp, Damo Acad, Hangzhou, Peoples R China

[3] Singapore Management Univ, Singapore, Singapore

来源：

THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE | 2022年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We focus on the confounding bias between language and location in the visual grounding pipeline, where we find that the bias is the major visual reasoning bottleneck. For example, the grounding process is usually a trivial language-location association without visual reasoning, e.g., grounding any language query containing sheep to the nearly central regions, due to that most queries about sheep have ground-truth locations at the image center. First, we frame the visual grounding pipeline into a causal graph, which shows the causalities among image, query, target location and underlying confounder. Through the causal graph, we know how to break the grounding bottleneck: deconfounded visual grounding. Second, to tackle the challenge that the confounder is unobserved in general, we propose a confounder-agnostic approach called: Referring Expression Deconfounder (RED), to remove the confounding bias. Third, we implement RED as a simple language attention, which can be applied in any grounding method. On popular benchmarks, RED improves various state-of-the-art grounding methods by a significant margin. Code is available at: https://github.com/JianqiangH/Deconfounded_VG.

引用

页码：998 / 1006

页数：9

共 50 条

[1] Deconfounded Multimodal Learning for Spatio-temporal Video Grounding
Wang, Jiawei
Ma, Zhanchang
Cao, Da
Le, Yuquan
Xiao, Junbin
Chua, Tat-Seng
[J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 7521 - 7529
[2] Deconfounded Visual Question Generation with Causal Inference
Chen, Jiali
Guo, Zhenjun
Xie, Jiayuan
Cai, Yi
Li, Qing
[J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5132 - 5142
[3] VISUAL GROUNDING
CUMBOW, RC
[J]. AMERICAN FILM, 1978, 3 (10): : 16 - 16
[4] Grounding Visual Explanations
Hendricks, Lisa Anne
Hu, Ronghang
Darrell, Trevor
Akata, Zeynep
[J]. COMPUTER VISION - ECCV 2018, PT II, 2018, 11206 : 269 - 286
[5] Flexible Visual Grounding
Kim, Yongmin
Chu, Chenhui
Kurohashi, Sadao
[J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): STUDENT RESEARCH WORKSHOP, 2022, : 285 - 299
[6] Gaze Assisted Visual Grounding
Johari, Kritika
Tong, Christopher Tay Zi
Subbaraju, Vigneshwaran
Kim, Jung-Jae
Tan, U-Xuan
[J]. SOCIAL ROBOTICS, ICSR 2021, 2021, 13086 : 191 - 202
[7] Visual-Semantic Graph Matching for Visual Grounding
Jing, Chenchen
Wu, Yuwei
Pei, Mingtao
Hu, Yao
Jia, Yunde
Wu, Qi
[J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4041 - 4050
[8] Cross-Lingual Visual Grounding
Dong, Wenjian
Otani, Mayu
Garcia, Noa
Nakashima, Yuta
Chu, Chenhui
[J]. IEEE ACCESS, 2021, 9 : 349 - 358
[9] Grounding Language in Visual and Conversational Contexts
Fernandez, Raquel
[J]. WEB CONFERENCE 2021: COMPANION OF THE WORLD WIDE WEB CONFERENCE (WWW 2021), 2021, : 366 - 366
[10] Visual Grounding via Accumulated Attention
Deng, Chaorui
Wu, Qi
Wu, Qingyao
Hu, Fuyuan
Lyu, Fan
Tan, Mingkui
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7746 - 7755

← 1 2 3 4 5 →