Hierarchical cross-modal contextual attention network for visual grounding

被引:0
|
作者
Xin Xu
Gang Lv
Yining Sun
Yuxia Hu
Fudong Nian
机构
[1] Hefei University,School of Advanced Manufacturing Engineering
[2] Hefei Comprehensive National Science Center,Institute of Artificial Intelligence
[3] Anhui Jianzhu University,Anhui International Joint Research Center for Ancient Architecture Intellisencing and Multi
[4] University of Science and Technology of China,Dimensional Modeling
[5] Chinese Academy of Sciences,School of Information Science and Technology
来源
Multimedia Systems | 2023年 / 29卷
关键词
Visual grounding; Transformer; Multi-modal attention; Deep learning;
D O I
暂无
中图分类号
学科分类号
摘要
This paper explores the task of visual grounding (VG), which aims to localize regions of an image through sentence queries. The development of VG has significantly advanced with Transformer-based frameworks, which can capture image and text contexts without proposals. However, previous research has rarely explored hierarchical semantics and cross-interactions between two uni-modal encoders. Therefore, this paper proposes a Hierarchical Cross-modal Contextual Attention Network (HCCAN) for the VG task. The HCCAN model utilizes a visual-guided text contextual attention module, a text-guided visual contextual attention module, and a Transformer-based multi-modal feature fusion module. This approach not only captures intra-modality and inter-modality relationships through self-attention mechanisms but also captures the hierarchical semantics of textual and visual content in a common space. Experiments conducted on four standard benchmarks, including Flickr30K Entities and RefCOCO, RefCOCO+, RefCOCOg, demonstrate the effectiveness of the proposed method. The code is publicly available at https://www.github.com/cutexin66/HCCAN.
引用
下载
收藏
页码:2073 / 2083
页数:10
相关论文
共 50 条
  • [41] Cross-modal complementary network with hierarchical fusion for multimodal sentiment classification
    Peng, Cheng
    Zhang, Chunxia
    Xue, Xiaojun
    Gao, Jiameng
    Liang, Hongjian
    Niu, Zhengdong
    TSINGHUA SCIENCE AND TECHNOLOGY, 2022, 27 (04) : 664 - 679
  • [42] Cross-Modal Relationship Inference for Grounding Referring Expressions
    Yang, Sibei
    Li, Guanbin
    Yu, Yizhou
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 4140 - 4149
  • [43] Cross-Modal Complementary Network with Hierarchical Fusion for Multimodal Sentiment Classification
    Cheng Peng
    Chunxia Zhang
    Xiaojun Xue
    Jiameng Gao
    Hongjian Liang
    Zhengdong Niu
    Tsinghua Science and Technology, 2022, (04) : 664 - 679
  • [44] Cross-Modal Omni Interaction Modeling for Phrase Grounding
    Yu, Tianyu
    Hui, Tianrui
    Yu, Zhihao
    Liao, Yue
    Yu, Sansi
    Zhang, Faxi
    Liu, Si
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 1725 - 1734
  • [45] Hierarchical Multi-modal Contextual Attention Network for Fake News Detection
    Qian, Shengsheng
    Wang, Jinguang
    Hu, Jun
    Fang, Quan
    Xu, Changsheng
    SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 153 - 162
  • [46] Cross-modal Attention Network with Orthogonal Latent Memory for Rumor Detection
    Wu, Zekai
    Chen, Jiaxin
    Yang, Zhenguo
    Xie, Haoran
    Wang, Fu Lee
    Liu, Wenyin
    WEB INFORMATION SYSTEMS ENGINEERING - WISE 2021, PT I, 2021, 13080 : 527 - 541
  • [47] Cross-Modal Self-Attention Network for Referring Image Segmentation
    Ye, Linwei
    Rochan, Mrigank
    Liu, Zhi
    Wang, Yang
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 10494 - 10503
  • [48] Dual-supervised attention network for deep cross-modal hashing
    Peng, Hanyu
    He, Junjun
    Chen, Shifeng
    Wang, Yali
    Qiao, Yu
    PATTERN RECOGNITION LETTERS, 2019, 128 : 333 - 339
  • [49] BIDIRECTIONAL FOCUSED SEMANTIC ALIGNMENT ATTENTION NETWORK FOR CROSS-MODAL RETRIEVAL
    Cheng, Shuli
    Wang, Liejun
    Du, Anyu
    Li, Yongming
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 4340 - 4344
  • [50] Auditory Attention Detection via Cross-Modal Attention
    Cai, Siqi
    Li, Peiwen
    Su, Enze
    Xie, Longhan
    FRONTIERS IN NEUROSCIENCE, 2021, 15