Hierarchical cross-modal contextual attention network for visual grounding

被引：0

作者：

Xin Xu

Gang Lv

Yining Sun

Yuxia Hu

Fudong Nian

机构：

[1] Hefei University,School of Advanced Manufacturing Engineering

[2] Hefei Comprehensive National Science Center,Institute of Artificial Intelligence

[3] Anhui Jianzhu University,Anhui International Joint Research Center for Ancient Architecture Intellisencing and Multi

[4] University of Science and Technology of China,Dimensional Modeling

[5] Chinese Academy of Sciences,School of Information Science and Technology

来源：

Multimedia Systems | 2023年 / 29卷

关键词：

Visual grounding; Transformer; Multi-modal attention; Deep learning;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

This paper explores the task of visual grounding (VG), which aims to localize regions of an image through sentence queries. The development of VG has significantly advanced with Transformer-based frameworks, which can capture image and text contexts without proposals. However, previous research has rarely explored hierarchical semantics and cross-interactions between two uni-modal encoders. Therefore, this paper proposes a Hierarchical Cross-modal Contextual Attention Network (HCCAN) for the VG task. The HCCAN model utilizes a visual-guided text contextual attention module, a text-guided visual contextual attention module, and a Transformer-based multi-modal feature fusion module. This approach not only captures intra-modality and inter-modality relationships through self-attention mechanisms but also captures the hierarchical semantics of textual and visual content in a common space. Experiments conducted on four standard benchmarks, including Flickr30K Entities and RefCOCO, RefCOCO+, RefCOCOg, demonstrate the effectiveness of the proposed method. The code is publicly available at https://www.github.com/cutexin66/HCCAN.

引用

页码：2073 / 2083

页数：10

共 50 条

[1] Hierarchical cross-modal contextual attention network for visual grounding
Xu, Xin
Lv, Gang
Sun, Yining
Hu, Yuxia
Nian, Fudong
[J]. MULTIMEDIA SYSTEMS, 2023, 29 (04) : 2073 - 2083
[2] Semantic-Aligned Cross-Modal Visual Grounding Network with Transformers
Zhang, Qianjun
Yuan, Jin
[J]. APPLIED SCIENCES-BASEL, 2023, 13 (09):
[3] CHAN: Cross-Modal Hybrid Attention Network for Temporal Language Grounding in Videos
Wang, Wen
Zhong, Ling
Gao, Guang
Wan, Minhong
Gu, Jason
[J]. 2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1499 - 1504
[4] Cross-modal orienting of visual attention
Hillyard, Steven A.
Stoermer, Viola S.
Feng, Wenfeng
Martinez, Antigona
McDonald, John J.
[J]. NEUROPSYCHOLOGIA, 2016, 83 : 170 - 178
[5] Learning Cross-Modal Context Graph for Visual Grounding
Liu, Yongfei
Wan, Bo
Zhu, Xiaodan
He, Xuming
[J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11645 - 11652
[6] Cross-modal contextual memory guides selective attention in visual-search tasks
Chen, Siyi
Shi, Zhuanghua
Zinchenko, Artyom
Mueller, Hermann J.
Geyer, Thomas
[J]. PSYCHOPHYSIOLOGY, 2022, 59 (07)
[7] Cross-modal exogenous visual selective attention
Zhao, C
Yang, H
Zhang, K
[J]. INTERNATIONAL JOURNAL OF PSYCHOLOGY, 2000, 35 (3-4) : 100 - 100
[8] Utilizing visual attention for cross-modal coreference interpretation
Byron, D
Mampilly, T
Sharma, V
Xu, TF
[J]. MODELING AND USING CONTEXT, PROCEEDINGS, 2005, 3554 : 83 - 96
[9] Cross-Modal Multistep Fusion Network With Co-Attention for Visual Question Answering
Lao, Mingrui
Guo, Yanming
Wang, Hui
Zhang, Xin
[J]. IEEE ACCESS, 2018, 6 : 31516 - 31524
[10] Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization
Xuan, Hanyu
Zhang, Zhenyu
Chen, Shuo
Yang, Jian
Yan, Yan
[J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 279 - 286

← 1 2 3 4 5 →