Hierarchical Vision and Language Transformer for Efficient Visual Dialog

被引:0
|
作者
He, Qiangqiang [1 ]
Zhang, Mujie [1 ]
Zhang, Jie [1 ]
Yang, Shang [1 ]
Wang, Chongjun [1 ]
机构
[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing, Peoples R China
基金
中国国家自然科学基金;
关键词
Visual Dialog; Hierarchical Transformer; Multi-Modal;
D O I
10.1007/978-3-031-44223-0_34
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The visual dialog task requires a deep understanding of an image and a dialog history to answer multiple consecutive questions. Existing research focuses on enhancing cross-modal interaction and fusion but often overlooks the computational complexity and higher-level interaction between the two modalities. This paper proposes a hierarchical vision and language Transformer (HVLT) to address these issues. Specifically, HVLT employs a convolution-like design to learn the interaction and fusion of images and text at different levels. We employ a token merging module to aggregate four spatially adjacent image tokens and four temporally adjacent text tokens into one token and use the expanded [CLS] token to fuse image and text information in a new dimension. This hierarchical architecture allows the model to focus on feature maps of different sizes and dialog history at word, phrase, and sentence levels and reduces the time overhead. We tailor two training objectives for HVLT: masked language regression (MLR) and next sentence prediction (NSP), which help the model understand images and language and learn their relationships. Experimental results on the VisDial v0.9 and v1.0 datasets demonstrate the competitive performance of HVLT. Finally, we visualize the attention to gain insights into how HVLT works in practice, shedding light on its interpretability.
引用
收藏
页码:421 / 432
页数:12
相关论文
共 50 条
  • [1] Aligning vision-language for graph inference in visual dialog
    Jiang, Tianling
    Shao, Hailin
    Tian, Xin
    Ji, Yi
    Liu, Chunping
    IMAGE AND VISION COMPUTING, 2021, 116
  • [2] Hierarchical attention vision transformer for fine-grained visual classification
    Hu, Xiaobin
    Zhu, Shining
    Peng, Taile
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2023, 91
  • [3] Unsupervised and Pseudo-Supervised Vision-Language Alignment in Visual Dialog
    Chen, Feilong
    Zhang, Duzhen
    Chen, Xiuyi
    Shi, Jing
    Xu, Shuang
    Xu, Bo
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4142 - 4153
  • [4] Adaptively bypassing vision transformer blocks for efficient visual tracking
    Yang, Xiangyang
    Zeng, Dan
    Wang, Xucheng
    Wu, You
    Ye, Hengzhou
    Zhao, Qijun
    Li, Shuiwang
    PATTERN RECOGNITION, 2025, 161
  • [5] Hierarchical Transformer for Task Oriented Dialog Systems
    Santra, Bishal
    Anusha, Potnuru
    Goyal, Pawan
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 5649 - 5658
  • [6] Long-Short Transformer: Efficient Transformers for Language and Vision
    Zhu, Chen
    Ping, Wei
    Xiao, Chaowei
    Shoeybi, Mohammad
    Goldstein, Tom
    Anandkumar, Anima
    Catanzaro, Bryan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [7] Vision transformer-based visual language understanding of the construction process
    Yang, Bin
    Zhang, Binghan
    Han, Yilong
    Liu, Boda
    Hu, Jiniming
    Jin, Yiming
    ALEXANDRIA ENGINEERING JOURNAL, 2024, 99 : 242 - 256
  • [8] Vision-Language Transformer for Interpretable Pathology Visual Question Answering
    Naseem, Usman
    Khushi, Matloob
    Kim, Jinman
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2023, 27 (04) : 1681 - 1690
  • [9] Exploring Lightweight Hierarchical Vision Transformers for Efficient Visual Tracking
    Kang, Ben
    Chen, Xin
    Wang, Dong
    Peng, Houwen
    Lu, Huchuan
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 9578 - 9587
  • [10] Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training
    Kong, Zhenglun
    Ma, Haoyu
    Yuan, Geng
    Sun, Mengshu
    Xie, Yanyue
    Dong, Peiyan
    Meng, Xin
    Shen, Xuan
    Tang, Hao
    Qin, Minghai
    Chen, Tianlong
    Ma, Xiaolong
    Xie, Xiaohui
    Wang, Zhangyang
    Wang, Yanzhi
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 7, 2023, : 8360 - 8368