Hierarchical cross-modal contextual attention network for visual grounding

被引:0
|
作者
Xin Xu
Gang Lv
Yining Sun
Yuxia Hu
Fudong Nian
机构
[1] Hefei University,School of Advanced Manufacturing Engineering
[2] Hefei Comprehensive National Science Center,Institute of Artificial Intelligence
[3] Anhui Jianzhu University,Anhui International Joint Research Center for Ancient Architecture Intellisencing and Multi
[4] University of Science and Technology of China,Dimensional Modeling
[5] Chinese Academy of Sciences,School of Information Science and Technology
来源
Multimedia Systems | 2023年 / 29卷
关键词
Visual grounding; Transformer; Multi-modal attention; Deep learning;
D O I
暂无
中图分类号
学科分类号
摘要
This paper explores the task of visual grounding (VG), which aims to localize regions of an image through sentence queries. The development of VG has significantly advanced with Transformer-based frameworks, which can capture image and text contexts without proposals. However, previous research has rarely explored hierarchical semantics and cross-interactions between two uni-modal encoders. Therefore, this paper proposes a Hierarchical Cross-modal Contextual Attention Network (HCCAN) for the VG task. The HCCAN model utilizes a visual-guided text contextual attention module, a text-guided visual contextual attention module, and a Transformer-based multi-modal feature fusion module. This approach not only captures intra-modality and inter-modality relationships through self-attention mechanisms but also captures the hierarchical semantics of textual and visual content in a common space. Experiments conducted on four standard benchmarks, including Flickr30K Entities and RefCOCO, RefCOCO+, RefCOCOg, demonstrate the effectiveness of the proposed method. The code is publicly available at https://www.github.com/cutexin66/HCCAN.
引用
收藏
页码:2073 / 2083
页数:10
相关论文
共 50 条
  • [1] Hierarchical cross-modal contextual attention network for visual grounding
    Xu, Xin
    Lv, Gang
    Sun, Yining
    Hu, Yuxia
    Nian, Fudong
    [J]. MULTIMEDIA SYSTEMS, 2023, 29 (04) : 2073 - 2083
  • [2] Semantic-Aligned Cross-Modal Visual Grounding Network with Transformers
    Zhang, Qianjun
    Yuan, Jin
    [J]. APPLIED SCIENCES-BASEL, 2023, 13 (09):
  • [3] CHAN: Cross-Modal Hybrid Attention Network for Temporal Language Grounding in Videos
    Wang, Wen
    Zhong, Ling
    Gao, Guang
    Wan, Minhong
    Gu, Jason
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1499 - 1504
  • [4] Cross-modal orienting of visual attention
    Hillyard, Steven A.
    Stoermer, Viola S.
    Feng, Wenfeng
    Martinez, Antigona
    McDonald, John J.
    [J]. NEUROPSYCHOLOGIA, 2016, 83 : 170 - 178
  • [5] Learning Cross-Modal Context Graph for Visual Grounding
    Liu, Yongfei
    Wan, Bo
    Zhu, Xiaodan
    He, Xuming
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11645 - 11652
  • [6] Cross-modal contextual memory guides selective attention in visual-search tasks
    Chen, Siyi
    Shi, Zhuanghua
    Zinchenko, Artyom
    Mueller, Hermann J.
    Geyer, Thomas
    [J]. PSYCHOPHYSIOLOGY, 2022, 59 (07)
  • [7] Cross-modal exogenous visual selective attention
    Zhao, C
    Yang, H
    Zhang, K
    [J]. INTERNATIONAL JOURNAL OF PSYCHOLOGY, 2000, 35 (3-4) : 100 - 100
  • [8] Utilizing visual attention for cross-modal coreference interpretation
    Byron, D
    Mampilly, T
    Sharma, V
    Xu, TF
    [J]. MODELING AND USING CONTEXT, PROCEEDINGS, 2005, 3554 : 83 - 96
  • [9] Cross-Modal Multistep Fusion Network With Co-Attention for Visual Question Answering
    Lao, Mingrui
    Guo, Yanming
    Wang, Hui
    Zhang, Xin
    [J]. IEEE ACCESS, 2018, 6 : 31516 - 31524
  • [10] Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization
    Xuan, Hanyu
    Zhang, Zhenyu
    Chen, Shuo
    Yang, Jian
    Yan, Yan
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 279 - 286