Hierarchical cross-modal contextual attention network for visual grounding

被引:0
|
作者
Xin Xu
Gang Lv
Yining Sun
Yuxia Hu
Fudong Nian
机构
[1] Hefei University,School of Advanced Manufacturing Engineering
[2] Hefei Comprehensive National Science Center,Institute of Artificial Intelligence
[3] Anhui Jianzhu University,Anhui International Joint Research Center for Ancient Architecture Intellisencing and Multi
[4] University of Science and Technology of China,Dimensional Modeling
[5] Chinese Academy of Sciences,School of Information Science and Technology
来源
Multimedia Systems | 2023年 / 29卷
关键词
Visual grounding; Transformer; Multi-modal attention; Deep learning;
D O I
暂无
中图分类号
学科分类号
摘要
This paper explores the task of visual grounding (VG), which aims to localize regions of an image through sentence queries. The development of VG has significantly advanced with Transformer-based frameworks, which can capture image and text contexts without proposals. However, previous research has rarely explored hierarchical semantics and cross-interactions between two uni-modal encoders. Therefore, this paper proposes a Hierarchical Cross-modal Contextual Attention Network (HCCAN) for the VG task. The HCCAN model utilizes a visual-guided text contextual attention module, a text-guided visual contextual attention module, and a Transformer-based multi-modal feature fusion module. This approach not only captures intra-modality and inter-modality relationships through self-attention mechanisms but also captures the hierarchical semantics of textual and visual content in a common space. Experiments conducted on four standard benchmarks, including Flickr30K Entities and RefCOCO, RefCOCO+, RefCOCOg, demonstrate the effectiveness of the proposed method. The code is publicly available at https://www.github.com/cutexin66/HCCAN.
引用
下载
收藏
页码:2073 / 2083
页数:10
相关论文
共 50 条
  • [31] The Neural Correlates of Visual and Auditory Cross-Modal Selective Attention in Aging
    Rienacker, Franziska
    Van Gerven, Pascal W. M.
    Jacobs, Heidi I. L.
    Eck, Judith
    Van Heugten, Caroline M.
    Guerreiro, Maria J. S.
    FRONTIERS IN AGING NEUROSCIENCE, 2020, 12
  • [32] Cross-modal event extraction via Visual Event Grounding and Semantic Relation Filling
    Liu, Maofu
    Zhou, Bingying
    Hu, Huijun
    Qiu, Chen
    Zhang, Xiaokang
    Information Processing and Management, 2025, 62 (03):
  • [33] Cross-modal Relational Reasoning Network for Visual Question Answering
    Chen, Hongyu
    Liu, Ruifang
    Peng, Bo
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3939 - 3948
  • [34] VISUAL ATTENTION IN A VISUAL-HAPTIC, CROSS-MODAL MATCHING TASK IN CHILDREN AND ADULTS
    Cote, Carol Ann
    PERCEPTUAL AND MOTOR SKILLS, 2015, 120 (02) : 381 - 396
  • [35] Supervised Hierarchical Cross-Modal Hashing
    Sun, Changchang
    Song, Xuemeng
    Feng, Fuli
    Zhao, Wayne Xin
    Zhang, Hao
    Nie, Liqiang
    PROCEEDINGS OF THE 42ND INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '19), 2019, : 725 - 734
  • [36] Cross-modal links in spatial attention
    Driver, J
    Spence, C
    PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY B-BIOLOGICAL SCIENCES, 1998, 353 (1373) : 1319 - 1331
  • [37] Cross-modal decoupling in temporal attention
    Muehlberg, Stefanie
    Oriolo, Giovanni
    Soto-Faraco, Salvador
    EUROPEAN JOURNAL OF NEUROSCIENCE, 2014, 39 (12) : 2089 - 2097
  • [38] Cross-modal synergies in spatial attention
    Driver, J
    Eimer, M
    Macaluso, E
    Van Velzen, J
    PERCEPTION, 2003, 32 : 15 - 15
  • [39] Cross-modal attention and letter recognition
    Wesner, Michael
    Miller, Lisa
    INTERNATIONAL JOURNAL OF PSYCHOLOGY, 2008, 43 (3-4) : 343 - 343
  • [40] CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network
    Peng, Yuxin
    Qi, Jinwei
    Huang, Xin
    Yuan, Yuxin
    IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (02) : 405 - 420