Hierarchical Vision and Language Transformer for Efficient Visual Dialog

被引:0
|
作者
He, Qiangqiang [1 ]
Zhang, Mujie [1 ]
Zhang, Jie [1 ]
Yang, Shang [1 ]
Wang, Chongjun [1 ]
机构
[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing, Peoples R China
基金
中国国家自然科学基金;
关键词
Visual Dialog; Hierarchical Transformer; Multi-Modal;
D O I
10.1007/978-3-031-44223-0_34
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The visual dialog task requires a deep understanding of an image and a dialog history to answer multiple consecutive questions. Existing research focuses on enhancing cross-modal interaction and fusion but often overlooks the computational complexity and higher-level interaction between the two modalities. This paper proposes a hierarchical vision and language Transformer (HVLT) to address these issues. Specifically, HVLT employs a convolution-like design to learn the interaction and fusion of images and text at different levels. We employ a token merging module to aggregate four spatially adjacent image tokens and four temporally adjacent text tokens into one token and use the expanded [CLS] token to fuse image and text information in a new dimension. This hierarchical architecture allows the model to focus on feature maps of different sizes and dialog history at word, phrase, and sentence levels and reduces the time overhead. We tailor two training objectives for HVLT: masked language regression (MLR) and next sentence prediction (NSP), which help the model understand images and language and learn their relationships. Experimental results on the VisDial v0.9 and v1.0 datasets demonstrate the competitive performance of HVLT. Finally, we visualize the attention to gain insights into how HVLT works in practice, shedding light on its interpretability.
引用
收藏
页码:421 / 432
页数:12
相关论文
共 50 条
  • [31] ResT: An Efficient Transformer for Visual Recognition
    Zhang, Qing-Long
    Yang, Yu -Bin
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [32] Efficient Vision Transformer via Token Merger
    Feng, Zhanzhou
    Zhang, Shiliang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 4156 - 4169
  • [33] LAND USE CLASSIFICATION EFFICIENT VISION TRANSFORMER
    Depoian, Arthur C., II
    Bailey, Colleen P.
    Guturu, Parthasarathy
    IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 2922 - 2925
  • [34] Generating Robot Action Sequences: An Efficient Vision-Language Models with Visual Prompts
    Cai, Weihao
    Mori, Yoshiki
    Shimada, Nobutaka
    2024 INTERNATIONAL WORKSHOP ON INTELLIGENT SYSTEMS, IWIS 2024, 2024,
  • [35] MiniMedGPT: Efficient Large Vision-Language Model for medical Visual Question Answering
    Alsabbagh, Abdel Rahman
    Mansour, Tariq
    Al-Kharabsheh, Mohammad
    Ebdah, Abdel Salam
    Al-Emaryeen, Roa'a
    Al-Nahhas, Sara
    Mahafza, Waleed
    Al-Kadi, Omar
    PATTERN RECOGNITION LETTERS, 2025, 189 : 8 - 16
  • [36] IRSTFormer: A Hierarchical Vision Transformer for Infrared Small Target Detection
    Chen, Gao
    Wang, Weihua
    Tan, Sirui
    REMOTE SENSING, 2022, 14 (14)
  • [37] Recurrent Vision Transformer for Solving Visual Reasoning Problems
    Messina, Nicola
    Amato, Giuseppe
    Carrara, Fabio
    Gennaro, Claudio
    Falchi, Fabrizio
    IMAGE ANALYSIS AND PROCESSING, ICIAP 2022, PT III, 2022, 13233 : 50 - 61
  • [38] Simulating Human Visual System Based on Vision Transformer
    Qiu, Mengyu
    Guo, Yi
    Zhang, Mingguang
    Zhang, Jingwei
    Lan, Tian
    Liu, Zhilin
    ACM SYMPOSIUM ON SPATIAL USER INTERACTION, SUI 2023, 2023,
  • [39] Deep Hierarchical Vision Transformer for Hyperspectral and LiDAR Data Classification
    Xue, Zhixiang
    Tan, Xiong
    Yu, Xuchu
    Liu, Bing
    Yu, Anzhu
    Zhang, Pengqiang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 3095 - 3110
  • [40] Hierarchical Pretrained Backbone Vision Transformer for Image Classification in Histopathology
    Zedda, Luca
    Loddo, Andrea
    Di Ruberto, Cecilia
    IMAGE ANALYSIS AND PROCESSING, ICIAP 2023, PT II, 2023, 14234 : 223 - 234