Cross-Modality Pyramid Alignment for Visual Intention Understanding

被引:2
|
作者
Ye, Mang [1 ]
Shi, Qinghongya [1 ]
Su, Kehua [1 ]
Du, Bo [1 ]
机构
[1] Wuhan Univ, Natl Engn Res Ctr Multimedia Software, Sch Comp Sci, Hubei Luojia Lab, Wuhan 430072, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Task analysis; Semantics; Feature extraction; Training; Image segmentation; Image color analysis; Visual intention understanding; cross modality; hierarchical relation;
D O I
10.1109/TIP.2023.3261743
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual intention understanding is the task of exploring the potential and underlying meaning expressed in images. Simply modeling the objects or backgrounds within the image content leads to unavoidable comprehension bias. To alleviate this problem, this paper proposes a Cross-modality Pyramid Alignment with Dynamic optimization (CPAD) to enhance the global understanding of visual intention with hierarchical modeling. The core idea is to exploit the hierarchical relationship between visual content and textual intention labels. For visual hierarchy, we formulate the visual intention understanding task as a hierarchical classification problem, capturing multiple granular features in different layers, which corresponds to hierarchical intention labels. For textual hierarchy, we directly extract the semantic representation from intention labels at different levels, which supplements the visual content modeling without extra manual annotations. Moreover, to further narrow the domain gap between different modalities, a cross-modality pyramid alignment module is designed to dynamically optimize the performance of visual intention understanding in a joint learning manner. Comprehensive experiments intuitively demonstrate the superiority of our proposed method, outperforming existing visual intention understanding methods.
引用
收藏
页码:2190 / 2201
页数:12
相关论文
共 50 条
  • [1] LOCAL CROSS-MODALITY IMAGE ALIGNMENT USING UNSUPERVISED LEARNING
    BERNANDER, O
    KOCH, C
    [J]. LECTURE NOTES IN COMPUTER SCIENCE, 1990, 427 : 573 - 575
  • [2] Deep Unified Cross-Modality Hashing by Pairwise Data Alignment
    Wang, Yimu
    Xue, Bo
    Cheng, Quan
    Chen, Yuhui
    Zhang, Lijun
    [J]. PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 1129 - 1135
  • [3] Cross-Modality Bridging and Knowledge Transferring for Image Understanding
    Yan, Chenggang
    Li, Liang
    Zhang, Chunjie
    Liu, Bingtao
    Zhang, Yongdong
    Dai, Qionghai
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (10) : 2675 - 2685
  • [4] CROSS-MODALITY MATCHING
    AUERBACH, C
    [J]. QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 1973, 25 (NOV): : 492 - 495
  • [5] VISUAL-HAPTIC CROSS-MODALITY LEARNING OF SPATIAL ORIENTATION
    APPELLE, S
    GRAVETTER, F
    DAVIDSON, P
    [J]. BULLETIN OF THE PSYCHONOMIC SOCIETY, 1982, 20 (03) : 146 - 146
  • [6] Transformer-Based Visual Grounding with Cross-Modality Interaction
    Li, Kun
    Li, Jiaxiu
    Guo, Dan
    Yang, Xun
    Wang, Meng
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (06)
  • [7] Bridging the Cross-Modality Semantic Gap in Visual Question Answering
    Wang, Boyue
    Ma, Yujian
    Li, Xiaoyan
    Gao, Junbin
    Hu, Yongli
    Yin, Baocai
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, : 1 - 13
  • [8] Cross-modality person re-identification via modality-synergy alignment learning
    Lin, Yuju
    Wang, Banghai
    [J]. MACHINE VISION AND APPLICATIONS, 2024, 35 (06)
  • [9] CROSS-MODALITY TEMPORAL RESOLUTION FOR AUDITORY, VIBROTACTILE, AND VISUAL-STIMULI
    SINEX, DG
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1978, 63 : S52 - S52
  • [10] Cross-modality co-attention networks for visual question answering
    Dezhi Han
    Shuli Zhou
    Kuan Ching Li
    Rodrigo Fernandes de Mello
    [J]. Soft Computing, 2021, 25 : 5411 - 5421